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ABSTRACT 

Mapping  scientific  kernels  onto  high  performance  heterogeneous  computers  (HPHC) 
must  comply  with  certain  rules  of  thumb  or  heuristics.  Previous  research  by  Jackson  State 
University’s  (JSU)  HPHC  research  group  has  provided  anecdotal  evidence  illustrating 
some  of  these  rules/heuristics.  The  research  highlighted  by  this  thesis  corroborates  the 
credibility  of  these  rules.  In  particular,  four  versions  (two  pairs)  of  a  floating-point  sparse 
matrix  conjugate  gradient  (CG)  iterative  solver  are  presented.  JSU’s  state-of-the-art 
HPHC  utilizes  general  purpose  processors  (GPP)  and  heterogeneous  computational 
hardware,  in  particular,  a  field  programmable  gate  array  (FPGA),  to  develop  the  CG 
kernels.  The  first  version  of  the  pair  executes  strictly  on  the  GPP  and  the  second  uses  both 
the  GPP  and  FPGA  to  map  the  entire  CG  algorithm  onto  hardware.  For  the  second  pair,  a 
refactored  version  of  CG  is  used,  which  is  statically  analyzed  to  determine  where  the 
most  computationally  expensive  operation  occurs.  This  operation  is  the  sparse  matrix 
vector  multiply  (MVM)  kernel.  Based  on  this  analysis,  the  software  version  of  CG  is 
refactored  to  call  MVM  as  a  subroutine.  An  FPGA  version  of  the  MVM  algorithm  is  also 
developed  and  a  static  analysis  of  that  algorithm  suggests  a  speedup  of  the  MVM  kernel. 
All  four  version  of  CG  are  executed  using  a  specially  designed  set  of  sparse  matrices  and 
the  results  demonstrate  that  adherence  to  the  rules  of  thumb  and  heuristics  when  mapping 
scientific  kernels  onto  a  HPHC  can  lead  to  significant  speedups. 
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CHAPTER  1 
INTRODUCTION 

1.1  History  and  Thesis  Organization 

High  Performance  Heterogeneous  Computers  (HPHC)  have  consistently  delivered 
superior  performance  for  their  targeted  application  as  opposed  to  their  microprocessor 
based  counterpart.  Despite  this  significant  speed-up,  only  a  sparse  number  of  applications 
were  considerable  to  justify  the  design,  development,  and  support  costs  of  an  HPHC  (SRC 
Computers,  LLC  2010).  This  is  why  mainstream  supercomputing  companies  such  as 
Cray  and  SGI  have  had  HPHC  offerings  in  the  past  but  with  fleeting  commercial  success, 
however,  Seymour  R.  Cray’s  startup  company,  SRC  Computers  (SRC  Computers,  LLC 

2010) ,  has  had  some  commercial  success  in  the  HPHC  space  (Jane’s  Information  Group 

201 1) .  This  success  is  partially  attributable  to  their  Carte  compiler,  which  integrates  their 
proprietary  MAP  processors  and  user-specific  applications  written  in  Lortran  and/or  C, 
into  a  single  application  executable  code  (SRC  Computers  Inc.  2014). 

Previously,  most  complex  systems  in  communications,  military,  and  medical 
applications  were  simulated  or  modeled  using  floating-point  data  processing  in  C  or 
MATLAB,  which  is  a  widely  used  program  for  solving  algebraic  and  differential 
equations.  However,  the  lack  of  support  with  held  programmable  gate  array  (LPGA)  tool 
suites  made  floating-point  arithmetic  an  unnappealing  option  for  the  LPGA  designer, 
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thus,  final  implementations  were  nearly  always  performed  using  fixed-point  or  integer 
arithmetic  (Parker  2009).  With  the  recent  advancements  in  FPGA  technologies,  companies 
such  as  Altera  now  offer  single  and  double  precision  floating  intellectual  property  (IP) 
cores  fashioned  specifically  for  high  performance  applications.  Tools  such  as  Altera’ s 
MegaWizard  Plug-In  Manager  (Altera  Corp.  2011)  takes  the  hassle  out  of  custom  IP 
core  manufacturing  by  automatically  generating  optimized  floating-point  IP  cores  within 
the  FPGA  itself.  Such  IPs  have  been  successfully  mapped  onto  HPHCs  (Peay,  Morris, 
and  Abed  2011)  and  floating-point  kernels  targeting  specific  problem  domains  such  as 
molecular  dynamics  have  been  demonstrated  (Scrofano,  Gokhale,  Trouw,  and  Prasanna 
2006;  Herbordt,  Khan,  and  Dean  2009),  as  have  general-purpose  kernels  such  as  linear 
algebra  routines  (Zhuo  and  Prasanna  2005b;  Zhuo  and  Prasanna  2005a;  Wu,  Dou,  Lei, 
Zhou,  Wang,  and  Jiang  2009).  There  has  even  been  some  success  at  mapping  iterative 
solvers  onto  FPGAs  (Roldao  and  Constatinides  2010;  Morris  and  Prasanna  2006;  Morris 
and  Prasanna  2007). 

For  numerical  and  computational  applications,  these  FPGA-based  processors  must 
satisfy  a  variety  of  heuristics  and  rules  of  thumb  to  achieve  a  speedup  compared  with  their 
software  counterparts.  This  thesis  illuminates  the  challenging  computational  mapping 
process  while  simultaneously  showing  that  such  mappings  can  result  in  significant 
speedups.  Its  major  focus  is  to  demonstrate  the  veracity  of  the  concept  “the  three  P's” 
which  expresses  the  crucial  relationship  among  performance,  pipelining,  and  parallelism, 
which  is  perhaps  the  most  important  mapping  consideration  (Morris  and  Abed  2013).  In 
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addition,  portions  of  this  thesis  is  referenced  from  the  paper  which  will  be  submitted  to 
the  International  Conference  on  Engineering  of  Heterogeneous  Systems  and  Algorithms 
(ERSA)  (Hawkins,  Alfarra,  Morris,  and  Abed  ).  This  research  paper  is  organized  into  five 
chapters.  Chapter  1  begins  with  the  introduction  which  identifies  the  problem  and  presents 
the  significance  of  this  research.  Chapter  2  discusses  previous  work  and  literature  that 
proves  pivotal  in  the  area  of  attaining  a  speedup  on  a  HPHC.  Chapter  3  discusses  the  tools, 
technology,  and  methodology  used  in  conducting  this  research.  Chapter  4  presents  the 
results  derived  from  this  research  and  analyzes  the  platforms  performance  levels.  Lastly, 
Chapter  5  concludes  with  future  work  related  to  achieving  speedup  with  HPHCs. 

1 .2  Research  Objective 

The  results  and  conclusions  obtained  from  this  research  may  be  applied  to  various 
HPHC  areas.  By  way  of  a  sparse  matrix  conjugate  gradient  (CG)  iterative  solver, 
this  paper  illustrates  some  of  the  issues  associated  with  mapping  scientific  kernels 
onto  HPHCs.  The  CG  method  was  chosen  based  on  heuristics  developed  from  earlier 
research(Rice,  Abed,  and  Morris  2009;  Morris  and  Abed  2013;  Morris,  McGruder,  and 
Abed  2010).  Furthermore,  CG  is  a  method  that  is  frequently  used  in  the  real  world  and 
complex  enough  to  deal  with  issues  of  mapping  scientific  kernels  onto  FPGAs. 

In  this  research,  four  versions  (two  pairs)  of  CG  were  mapped  on  to  a  HPHC.  The 
first  version  of  the  pair  executes  strictly  on  the  general  purpose  processor  (GPP)  in  JSUs 
state-of-the-art  HPHC,  and  the  second  utilizes  both  the  GPP  and  the  heterogeneous 
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computational  hardware,  in  particular,  a  field  programmable  gate  array  (FPGA).  Not  too 
surprisingly,  the  first  hardware  version  does  not  perform  very  well  because  it  does  not 
consider  the  rules  and  heuristics  that  will  be  discussed  in  this  thesis. 

The  second  pair,  which  is  a  profiled  version  of  CG,  is  statically  analyzed  to  determine 
where  the  most  computationally  expensive  operation  occurs.  This  operation  is  the  sparse 
matrix  vector  multiply  (MVM)  kernel.  Based  on  this  analysis,  the  software  version  of 
CG  is  refactored  to  call  MVM  as  a  subroutine.  This  software  version  is  then  profiled  to 
determine  the  percentage  of  runtime  spent  in  MVM.  This  factor  is  called  the  fraction 
that  can  be  enhanced  (/e)  in  Amdahl’s  Law.  A  hardware  (FPGA)  version  of  the  MVM 
algorithm  is  also  developed  and  a  static  analysis  of  that  algorithm  suggests  a  speedup  of 
the  MVM  kernel.  This  factor  is  called  the  speedup  of  the  enhanced  component  (se)  in 
Amdahl’s  Law. 

The  combination  of  these  two  factors  (/e  and  se)  are  used  in  Amdahl’s  Law  to  estimate 
the  anticipated  overall  speed  (sG)  up  of  the  HPHC  version  of  CG.  The  second  HPHC 
version  performs  significantly  better  than  the  first  because  the  rules  and  heuristics  (to  be 
described  later)  are  followed,  i.e.,  MVM  was  a  great  candidate  for  kernel  mapping  onto  a 
HPHC  to  achieve  an  overall  speedup  of  CG. 
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CHAPTER  2 
LITERATURE  REVIEW 

This  chapter  discusses  the  necessary  background  information  that  serves  as  a  platform 
for  this  research.  This  chapter  is  partitioned  into  7  sections.  The  first  gives  rise  to  a 
particular  type  of  HPHC  as  it  examines  the  history  and  early  mappings  of  FPGAs. 
Section  two  discusses  heterogeneous  computers  (RC)  and  their  physical  architecture. 
Section  three  details  the  history  and  signficance  of  the  conjugate  gradient  algorithm.  The 
fourth  section  discusses  matrix  storage  formats  and  their  crucial  importance  in  scientific 
research.  Section  five  examines  the  types  of  matrices  and  their  importance  as  well.  Section 
six  discusses  in  detail  the  HPHC  target  platform;  the  SRC-7.  The  seventh  and  final  section 
of  this  chapter  details  Amhdal's  Law  which  is  an  approach  for  calculating  the  overall 
speed  up  of  a  system  when  only  part  of  the  system  has  been  enhanced. 

2.1  FPGAs 

2.1.1  History 

Field  programmable  gate  arrays  are  a  programmable  digital  logic  device  (PLD).  It  was 

invented  in  1984  by  Xilinx  co-founder  Ross  Freeman  (Xilinx,  Inc.  2006).  At  the  time  it 

was  groundbreaking  due  to  its  ability  to  be  dynamic  in  application.  However,  it  operated 

at  much  slower  speeds  than  Application-Specific  Integrated  Circuits  (ASICs).  It  also  used 
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more  transistors  than  its  fixed  competitor.  With  the  reduction  in  transistor  cost,  over  time, 
FPGAs  have  become  widespread  in  use. 

2.1.2  Architecture 

The  physical  architecture  of  an  FPGA,  as  shown  in  Figure  2.1,  consists  of  various 
components  that  are  able  to  communicate  via  switches.  These  switches  are  directional  and 
allow  multi-point  interconnection  between  components.  Fixed  logic  blocks  consist  of,  but 
are  not  limited  to,  the  central  processing  unit  (CPU),  random  access  memory  (RAM),  and 
clock.  Input/output  (I/O)  blocks  allow  for  data  to  be  made  available  to  the  FPGA  as  well 
as  provide  output  from  the  application.  Configurable  logic  blocks  (CLBs)  are  structures 
that  are  used  to  emulate  digital  logic  functions. 

As  shown  in  Figure  2.2,  CLBs  consist  of  multiple  slices  that  are  internally  connected 
aside  from  the  FPGA  component  routing  mechanism.  A  deeper  look  into  the  slices  reveals 
the  components  that  allow  for  the  FPGAs  configurable  design.  Slices  are  comprised  of 
multiple  lookup  tables  (LUTs),  flip-flops,  multiplexers,  and  control.  The  CLB  mimics  a 
digital  logic  function  set  in  place  by  the  developer  due  largely  in  part  to  the  lookup  tables 
(LUTs)  that  lie  within  the  slices.  LUTs  directly  resemble  digital  logic  using  memory,  as 
depicted  in  Figure  2.3. 

At  runtime,  the  FPGA  is  programmed  with  a  configuration  bitstream  (the  bitstream 
is  viewed  as  data,  but  it  is  the  driving  force  of  the  conversion  of  the  FPGA).  It  defines 
the  contents  of  LUTs,  interconnections  between  components,  and  input  or  output 
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Figure  2. 1  FPGA  Architecture 

(I/O).  Hence,  a  different  application  implemented  on  an  FPGA  lies  solely  within  the 
corresponding  bitsream.  The  FPGA  is  thereby  dynamic  in  functionality  though  it  is 
limited  in  speed  capability  in  comparison  to  its  ASIC  competitor.  If  a  new  application  is  to 
be  developed,  there  is  no  need  for  “hard  wiring”  as  in  the  historical  process.  Nonetheless, 
it  is  not  to  be  assumed  that  generating  the  corresponding  bitstream  is  a  trivial  task  -  for 
this  now  takes  many  steps  and  design  considerations  by  the  developer. 
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Figure  2.2  CLB  architecture 
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Figure  2.3  Lookup  Table 

2.1.3  FPGA  Benefits  and  Applications 

FPGAs  have  gained  popularity  as  a  platform  for  scientific  computation  applications. 

If  inadequately  designed,  the  platform  may  significantly  decrease  performance  (Zhuo, 
Morris,  and  Prasanna  2007).  Developers  have  also  used  the  FPGAs  to  create  significant 
increases  over  comparable  platforms  in  application-specific  integrated  circuits  (ASICs). 
For  example,  a  1 800x  increase  in  training  speed  has  been  obtained  with  Rankboost  web 
search  engines  over  its  prior  implementation  using  only  a  GPP  (Xu,  Cai,  Gao,  Zhang,  and 
Hsu  2009).  A  5.5  fold  speedup  in  pair-wise  computations  was  actualized  by  Nallamuthu 
et.al  using  the  heterogeneous  computing  environment.  FPGAs  provide  the  potential  for 
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high  performance  with  less  power  consumption.  Under  the  aforementioned  development, 
a  nine  fold  increase  in  the  ratio  of  power  to  performance  also  gave  an  added  benefit  to 
using  a  heterogeneous  environment  involving  an  FPGA  architecture  (Nallamuthu,  Smith, 
Hampton,  Agarwal,  and  Alam  2010). 

2.2  Heterogeneous  Computers 

Heterogeneous  computing  is  a  technology  that  encapsulates  fixed  and  variable 
structures.  It  allows  the  developer  to  modify  the  architecture  of  a  processing  element. 
Unlike  fixed  hardware,  the  variable  hardware  is  able  to  be  adapted  to  a  new  platform  in 
real  time.  The  RC  concept  was  invented  in  1960  by  Gerald  Estrin.  At  the  time,  research 
and  development  efforts  proved  impractical.  Developers  would  have  to  manually  place 
modules  and  wire  them  according  to  the  anticipated  configuration.  This  lengthy  approach 
was  highly  susceptible  to  human  errors,  and  circuits  would  be  limited  in  size.  Presently, 
the  manual  process  has  been  replaced  with  digital  control  switches  that  allow  the  variable 
structure  to  be  configured  through  programming.  Contemporary  high  performance 
heterogeneous  computing  (HPHC)  clusters  vary  with  its  models  and  architectures.  It  may 
consist  of  multiple  nodes  of  RCs  to  create  parallelism  in  computation  and  task  execution. 

2.2.1  Application  mapping 

An  application  may  be  realized  solely  in  software  running  on  a  GPP.  This  process 
usually  involves  code  written  in  a  high-level  language  that  is  compiled  and  linked  with 
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system  libraries  to  produce  a  binary  executable.  On  a  GPP  with  a  single  processor,  each 
task  is  executed  sequentially.  Here,  the  developer’s  primary  avenue  of  attaining  a  speedup 
is  through  optimizing  the  code.  With  an  application  placed  on  a  heterogeneous  computer, 
the  fixed  component  generally  calls  for  a  GPP,  and  the  variable  component  may  be  an 
FPGA.  The  application  is  developed  and  placed  on  an  RC  if  deemed  more  profitable  in 
terms  of  speedup.  This  concept  is  illustrated  in  Figure  2.4.  It  is  important  to  analyze  the 
potential  of  a  computationally  intensive  kernel  that  is  to  be  successfully  placed  on  an 
FPGA.  The  developer  strategically  evaluates  design  considerations  of  the  application. 

This  is  determined  through  various  metrics  that  analyze  its  application  size,  algorithm,  and 
memory-access  patterns  (Rice,  Abed,  and  Morris  2009). 

In  order  to  profile  the  potential  of  an  computational  kernel  that  is  to  be  placed  onto 
the  FPGA,  the  developer  takes  various  design  elements  into  consideration.  In  terms  of 
runtime,  the  developer  must  analyze  the  kernel  and  determine  its  overall  effect  on  the 
entire  application.  Its  particular  algorithm  is  evaluated  as  the  developer  must  examine 
the  kernel  size  in  relation  to  space  available  on  the  FPGA.  Concerning  the  entire  RC 
layout,  one  must  also  consider  memory  access  patterns.  The  bandwidth  available  for 
communication  between  the  fixed  and  variable  elements  should  be  capable  to  transfer  data 
without  creating  a  significant  increase  in  the  overall  runtime.  If  so,  some  of  this  time 
lost  in  transfer  among  elements  may  be  alleviated  through  data  reuse  once  it  is  placed 
onto  the  variable  computing  environment.  This  research  considers  a  dot  product  for  its 


computational  kernel. 
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Software  Only  Application 


Software  and  Hardware  Application 


Figure  2.4  Application  Mapping 


2.3  Conjugate  Gradient 

The  Conjugate  Gradient  (CG)  method,  which  was  discovered  in  1952  by  Hestens 
and  Stiefel,  is  an  algorithm  typically  used  to  solve  large  and  sparse  systems  of  linear 
equations,  Ax  =  b.  The  method  was  abandoned  in  the  1960s  due  to  a  lack  of  finite 
precision  computational  engines  which  caused  the  algorithm  to  diverge,  thus,  rendering 
it  useless.  In  1971,  however,  mathematician  J.K.  Reid  indicated  that  CG,  if  observed 
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as  an  iterative  method  rather  than  an  n  step  process,  will  ultimately  converge  with 
veritable  accuracy  (Paige  and  Saunders  1975).  Today,  the  Conjugate  Gradient  method  is 
acknowledged  as  one  of  the  top  10  algorithms  of  the  twentieth  century  (Van  der  Vorst 
2000).  It  is  the  de  facto  standard  algorithm  for  solving  equations  that  involve  symmetric 
positive  definite  (SPD)  matrices  (Morris  2013). 


2.3.1  CG  algorithm 


This  section  discuss  conjugate  gradient  as  an  iterative  method  to  solve  systems  of 
linear  equations 
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where  A  e  Mnxn  is  a  symmetric  ( A  =  AT)  positive  definite  (Vx  ^  0,  x7  Ax  >  0)  matrix, 
x  G  M"  is  the  unknown  vector  (our  desired  solution),  and  b  e  K"  is  the  constant  vector. 

It  can  also  be  shown  that  the  solution  Ax  =  b  is  the  same  as  the  x  vector  that 
minimizes  the  quadrtic  function, 
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i.e.,  V/(x)  =  0  =  Ax  =  b,  which  is  also  a  linear  combination  of  A-orthogonal(conjugate) 
vectors  {pi,  P2,  •  •  •  ,  p„}  that  forms  a  basis  for  Mn. 

Ax  =  b  can  also  be  expressed  in  Equation  2.2  as 

n 

x  =  5>P<-  (2.2) 

i= 1 

Finding  the  A-orthogonal  vectors  is  a  non-trivial  task;  therefore,  an  iterative  approach 
is  used.  In  particular,  it  can  be  shown  that  each  of  these  vectors  correspond  to  a  search 
direction,  therefore,  the  iterative  approach  is  in  effect  a  gradient  descent  method  that 
minimizes  the  quadratic  function  previously  mentioned.  After  each  step  in  the  given 
direction  there  is  a  residual,  r^,  which  in  a  sense  indicates  how  far  off  the  solution  is 
(Morris  2013).  It  can  be  shown  that  these  search  directions  are  in  a  direction  negative 
of  the  gradient  at  the  starting  point,  hence,  the  term  “ conjugate  gradient" .  Think  of 
“zig-zagging”  to  the  bottom  of  a  bowl.  Performing  this  type  of  gradient  descent  with 
infinite  precision,  the  iteratively  generated  set  of  A-orthogonal  vectors  would  correspond 
to  a  basis  for  Mn  as  depicted  in  Figure  2.5. 

Via  some  algebraic  manipulations  this  approach  can  be  refined  to  a  recurisve 
technique,  however,  one  of  the  issues  with  this  technique  is  that  the  residual  contains 
components  in  the  search  direction  that  have  already  been  taken.  Remember,  the  next 
search  direction  must  be  A-orthogonal.  It  is  necessary  to  remove  these  components  from 
the  residual  in  order  to  calculate  a  new  search  direction  via  the  projection  operater  (3 

which  can  be  observed  as  =  — T — - .  Before  discussing  the  final  algorithm,  one  detail 

rl-irk-i 
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1:  algorithm  primitiveCG(A,  x,  b) 

2:  Using  current  point,  calculate  residual,  rk  •<—  b  —  Axfc 

3:  Calculate  next  conjugate  vector,  p/,:+i  <—  rk  —  E  AfcPi 

i<k 

4:  Descend  to  the  next  point  xfc+1  xfc  +  a;fe+1pfc+i 

5:  If  not  the  minimum  point,  goto  2 

6:  end  algorithm 

Figure  2.5  Primitive  CG  algorithm 

that  needs  to  be  mentioned  is  the  classical  residual  norm.  This  is  a  standard  approach  used 
in  iterative  methods  such  as  CG,  Jacobi,  Newton-Raphson,  etc. 

After  a  further  improvements,  an  optimized  version  of  CG  is  realized  in  Figure  2.6. 
With  this  approach,  an  arbitrary  starting  point  x0  was  selected  which  then  descended.  The 
first  A-orthogonal  vector,  pi,  is  a  search  direction  opposite  to  the  gradient  as  depicted 
in  line  2,  therefore,  the  initial  search  direction  is  also  the  same  as  the  initial  residual  as 
illustrated  on  line  3.  The  next  line  is  a  criteria  for  forcing  the  algorithm  to  enter  the  loop. 
Line  5,  as  mentioned  previously,  is  part  of  the  equation  for  calculating  the  residual.  Rather 
than  continually  computing  this  value,  it  is  simply  calculated  one  time.  For  line  6,  the 
body  of  the  loop  requires  the  dot  product  of  the  current  residual,  rk,  and  the  previous 
residual,  r^-i.  Instead  of  calculating  two  dot  products  within  the  loop,  the  dot  product 
from  the  previous  iteration  was  reused,  i.e.,  rj r0  is  a  dot  product.  Line  7  creates  the 
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i :  algorithm  optimizedCGCA  x.  b) 

2:  pi  <—  b  —  tIxo 

3:  r0  <r-  pi 

4:  A  <—  €  +  1 

5:  ovcr/jnorrn  <  l/||b|| 

6:  rT r0id  <r-  r/'rn 

7:  k  <-  1 

8:  while  (A  >  e)  .and.  ( k  <  fcmax)  do 

9:  Vap  <-  Apk 

10:  ak  <—  rTr0\d/pJvap 

11:  xfc  <r-  Xfe-x  +akpk 

12:  rk  <r-  rA:_i  -  akvap 

13:  rTYnew  r^’rfc 

14:  0k<-rTr 

new/ 1  '7V0ld 

15:  rTr0\a  4  vTv new 

16:  pfc+l  4-  rA:  +  fikpk 

17:  A  ■<  |  rA  ||  overftnorm 

18:  k~ 

19:  end  while 

20:  return  (xA;_i) 

21:  end  algorithm 


Figure  2.6  Optimized  CG  algorithm 

iteration  index,  k,  which  is  also  part  of  the  criteria  for  the  while  loop  to  stop  infinite 
looping. 

It  can  be  shown  that  the  matrix  vector  product  is  needed  twice  within  CG,  however, 
it  is  computed  once  on  line  9  because  it  is  an  expensive  O  (n2)  algorithm.  The  next 
line  computes  the  step  size,  ak.  along  the  direction  of  the  CG.  For  line  11,  the  new 
approximation  for  x  is  calculated  by  descending  in  the  conjugate  search  direction  of  the 
step  size,  i.e.,  x/,._|  +  akpk.  Line  12,  recursively  computes  the  new  residual.  Then  the 
dot  product  of  the  new  residual,  rTrnew  is  calculated,  which  will  be  used  in  subsequent 
computations.  Line  14  computes  the  projection  operator  /3  which  removes  from  the 
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residual  all  previous  search  directions.  Since  the  dot  product  of  the  previous  residual  is  no 
longer  needed  and  to  prevent  calculating  the  dot  product  at  the  next  iteration,  line  15 
retains  the  current  residual  for  the  next  iteration.  The  new  search  direction  can  finally  be 
calculated,  by  using  the  projection  operator  to  remove  the  residual  from  all  components 
along  the  previous  conjugate  search  directions,  i.e.,  rfc  +  ftkPk-  Line  17  computes  the 
residual  norm  to  check  the  algorithm  for  convergence.  Line  1 8  simply  increments  the 
iteration  index.  If  the  algorithm  has  converged,  the  solution,  xfc_1?  is  returned. 

This  optimized  version  will  become  the  blueprint  for  the  four  versions  of  CG  as 
discussed  in  Chapter  3. 

2.4  Sparse  Matrix  Storage  Formats 

According  to  J.  Dongarra  (Dongerra  2000),  the  efficiency  of  most  iterative  methods, 
such  as  CG,  can  be  attributed  to  the  performance  of  the  matrix- vector  multiply  which 
correlates  directly  to  the  storage  scheme  used  for  the  matrix.  The  reasoning  behind  these 
formats  begins  with  the  concept  “sparse  matrix”.  A  matrix  is  considered  sparse  if  there 
are  more  zero  than  non-zero  elements,  in  which  case,  it  is  more  efficient  to  store  only 
the  non-zero  elements  because  memory  consumption  can  be  reduced  and  performance 
can  be  increased  by  using  a  specialized  sparse  storage  representation  (B.  Jacobs  2013). 
Sparse  storage  schemes  allocate  contiguous  storage  in  memory  for  the  nonzero  elements, 
which  in  turn  requires  a  scheme  for  knowing  where  the  elements  fit  into  the  full  matrix 
(Dongerra  2000),  i.e.,  storing  all  non-zero  elements  of  the  matrix  into  a  linear  array  and 
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providing  auxiliary  arrays  to  describe  the  locations  of  the  non-zero  elements  in  the 
original  matrix  (D.  Dodson  and  Lewis  1991).  Coordinate  format  (COO)  and  Compressed 
Sparse  Row  (CSR)  are  the  primary  sparse  matrix  representations  used  throughout  this 
research  and  shall  be  discussed  in  depth. 

2.4.1  Coordinate  Format 

Also  known  as  the  ‘ijv’  or  ‘triplet’  format,  COO  form  is  the  simplest  method  to 
construct  a  matrix.  It  allows  for  very  fast  conversion  to  and  from  CSR  format  for 
arithmetic  and  matrix  vector  operations  (Scipy  2009).  COO  format  uses  the  vectors  row, 
col,  and  val  to  store  the  row  indices,  the  column  indices,  and  values  of  the  matrix, 
respectfully.  For  example,  a  5  x  5  matrix  is  depicted  in  Fig.  2.7. 

7  0  0  2  0 

0  3  0  0  5 

2  0  6  1  1 

0  4  7  0  0 

9  3  0  0  4 

Figure  2.7  Simple  5x5  matrix 

This  simple  matrix  can  be  easily  converted  to  COO  format  as  shown  in  Fig.  2.8.  Given 
an  index,  i,  that  walks  each  element  of  the  three  vectors  Row,  Column,  Value,  one  can 
easily  reconstruct  the  original  matrix. 
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row 


1122333344555 


col 


1415134523125 


val 


7235261147934 


Figure  2.8  Example  of  COO  format 
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2.4.2  Compressed  Sparse  Row 

If  a  matrix  has  a  small  number  of  nonzero  values,  nz  <A  n2,  where  n  is  the  matrix 
order  and  nz  is  the  number  of  nonzeros,  CSR  format  (Saad  2009)  can  reduce  storage 
and  computational  requirements.  A  real  matrix  with  n  =  1,  024  and  nz  =  4,  096,  which 
requires  4MB  in  dense  format,  can  be  stored  in  CSR  format  in  about  36KB.  Furthermore, 
needless  computations  with  zero  values  can  be  avoided.  For  a  real  matrix  of  order  n  with 
nz  nonzeros,  CSR  uses  vectors  val,  col,  and  ptr  to  store  only  the  nonzero  values  and  to 
identify  the  associated  row  and  column  indices.  Vector  val  is  a  real  vector  of  length  nz 
containing  the  nonzero  values  obtained  from  a  row-wise  traversal  of  the  matrix.  The  col 
integer  vector  is  also  of  length  nz  and  contains  the  column  index  of  each  nonzero  value, 
i.e.,  (vali,  =  ctij )  =4*  (coin  =  j )  ■  The  ptr  integer  vector  is  of  length  n  +  1  and  contains  the 
index  in  val  where  each  matrix  row  starts.  For  example,  the  first  nonzero  element  of 
matrix  row  m  is  found  at  index  ptrm  of  val.  By  convention,  ptrn+]  =  nz  +  1.  Notice  that 
(atj)  =>-  {ptrt  <  j  <  ptri+i)  for  all  i.  An  example,  depicting  the  CSR  representation  of  an 
order  n  —  4  sparse  matrix,  is  shown  in  Fig.  2.9.  Consider  ptr:i  =  5;  this  indicates  that  row 
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8  0  2 
0  5  0 
0  0  6 
0  4  0 

Figure  2.9  Example  of  CSR  format 

3  of  the  matrix  begins  at  index  5  of  val  and  col.  Notice  that  val 5  =  6  and  that  co/5  =  3. 
Thus,  a3> 3  =  6  as  in  the  dense  representation  of  the  matrix. 

2.4.3  Aligned  CSR  Format 

For  the  CG  processor  described  in  this  paper,  exactly  k  —  8  values  are  read  from  val 
and  col  during  each  clock  cycle.  This  is  accomplished  by  striping  these  vectors  across 
multiple  memory  banks.  To  avoid  large  multiplexers  and  the  associated  degradation  in 
performance,  the  matrix  rows  are  simply  padded  with  zeros  (as  needed)  to  ensure  they 
have  an  exact  multiple  of  k  values.  The  ptr  vector  is  modified  to  express  indices  in  terms 
of  k- sized  groups  of  data,  and  the  term  knz  (analogous  to  nz  )  is  the  number  of  A-groups. 
This  entire  process,  which  is  carried  out  by  the  software  module  when  it  marshals  the  data 
for  the  FPGA-based  processor,  is  known  as  “/c-alignment.”  This  (  -alignment  process 
produces  scalar  knz  and  vectors  kval,  kcol,  and  kptr.  As  before,  kptrn+ 1  =  knz  +  1. 

Perhaps  the  idea  is  better  understood  by  way  of  an  example.  If  one  assumes  that  k  =  2, 
then  the  sparse  matrix  represented  in  Fig.  2.9  could  be  represented  in  k- aligned  CSR 
format  as  shown  in  Fig.  2.10.  Each  bracketed  k- group  of  data  represents  the  contents  at  a 
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given  index  across  a  striped  data  set.  Consider  kptr 4  =  5;  this  indicates  that  matrix  row  4 
begins  at  index  5  of  striped  memory  banks  kval  and  kcol.  Notice  kval 5  =  [4,  7]  (2  banks, 
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Figure  2.10  Example  of  aligned  CSR  format  (k  =  2,  knz  =  5) 

2  values)  and  that  kcol5  =  [2, 4]  (2  banks,  2  indices).  Thus,  as  in  the  dense  matrix  of 
Fig.  2.9,  «4)2  =  4  and  a4i4  =  7. 

2.5  Test  Matrices 

During  this  research  it  became  increasingly  apparent  that  the  complexity  of  the  matrix 
itself  played  a  significant  role  when  analyzing  the  number  of  iterations  and  convergence 
rate.  Conditioning  sparse  matrices  into  a  form  more  suitable  for  numerical  solution,  or 
preconditioning  (Axelsson  1996),  resulted  in  rapidly  converging  solutions  using  the 
software  version  of  CG,  thereby  rendering  the  speedup  which  had  hoped  to  be  obtained 
via  the  fully  pipelined  and  parallelized  hardware  version  counterproductive.  Though 
preconditioning  is  the  key  to  a  successful  iterative  solver  such  as  conjugate  gradient  (Song 
),  the  following  section  will  discuss  the  ill-conditioned(non-preconditioned)  matrices 
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employed  that  greatly  challenge  CG  in  addition  to  obtaining  a  significant  speedup  via  a 
HPHC. 

2.5.1  UFL  Sparse  Matrix  Collection 

The  University  of  Florida  Sparse  Matrix  Collection,  managed  by  Dr.  Tim  Davis  and 
Dr.  Yifan  Hu,  is  a  large  and  actively  growing  set  of  sparse  matrices  widely  used  by  the 
numerical  linear  algebra  community  for  the  development  and  performance  evaluation 
of  sparse  matrix  algorithms  such  as  conjugate  gradient.  The  collection  covers  a  vast 
spectrum  of  domains  ranging  from  mathematics  and  physics,  to  civil  engineering  and 
computer  science.  With  over  2547  matrices,  UFL’s  Sparse  Matrix  Collection  boasts 
its  largest  matrix  as  having  a  dimension  of  1 18  million  with  almost  2  billion  nonzero 
entries  (Davis  and  Hu  2011).  UFL  provides  a  Java  program(UFgui)  for  browsing  and 
downloading  matrices  in  Matrix  Market  format  and  the  complexity  of  the  matrix  itself  can 
be  manipulated  by  the  user.  Included  with  this  software  is  an  option  for  rendering  images 
that  correlate  to  the  matrix  generated  by  the  user. 

2.6  SRC-7 

2.6.1  Target  HPHC 

The  Jackson  State  University  research  group's  HPHC,  SRC-7,  has  two  3.0GHz  Intel 
Xeon  processors  with  a  16K  LI  cache,  2MB  L2  cache,  and  6GB  RAM.  The  MAP  Series 
H  processor  contains  two  Altera  EP2S180F1508C3  FPGAs  running  at  150MHz.  It  has 
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Figure  2.11  SRC-7  HPHC  architecture 


two  1GB  banks  of  global  common  memory  (GCM),  which  are  used  to  marshal  large  data 
sets  from  the  Xeon  processor  to  the  MAP  processor  and  to  return  large  data  sets  from  the 
MAP  to  the  Xeon.  The  Xeons  access  GCM  via  API  calls,  and  the  FPGAs  access  the  GCM 
via  DMA.  GCM  is  the  primary  mechanism  by  which  the  GPP  and  FPGA  interchange 
large  data  sets,  i.e.,  GCM  is  used  for  bulk  data  transfers  (as  opposed  to  individual  array 
elements).  GCM  can  be  accessed  by  the  GPP  via  API  calls  and  by  the  FPGA  via  direct 
memory  access  (DMA). 

There  are  16  banks  of  on  on-board  memory  (OBM)  associated  with  the  FPGAs.  Each 
bank  contains  over  500K  64-bit  words  and  can  be  separately  addressed.  This  allows  the 
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MAP  processor  to  simultaneously  access  up  to  16  words  per  clock  cycle.  OBM  is  the 
primary  mechanism  by  which  large  data  sets  can  be  striped  for  parallel  consumption  by 
the  FPGA,  i.e.,  up  to  16  values  can  be  read  from  OBM  each  FPGA  clock  cycle.  DMA  can 
be  used  to  move  data  between  the  OBM  and  Xeon  memory,  but  the  use  of  GCM  is  more 
efficient.  The  MAP  processor  is  connected  to  the  Xeon  GPP  motherboard  through  SRC’s 
proprietary  SNAP  D  interface,  which  is  mapped  into  the  Xeon  memory  subsystem. 

The  MAP  processor  has  a  streaming  DMA  capability  and  an  inter/intra-FPGA 
streaming  capability  that  allows  computation  to  overlap  communication  and  facilitates 
parallelism  within  the  MAP  processor.  Multiple  configurable  single-port  BRAMs  are 
located  within  the  FPGA  fabric.  The  BRAMs  are  typically  used  to  hold  arrays  that  are 
accessed  serially.  A  single  value  can  be  read  from  each  BRAM  array  during  a  given  clock 
cycle. 

2.6.2  HLL  development  flow 

Vendors  such  as  Mentor  Graphics  and  Altera  have  created  compilers,  DK  Design  Suite 
(Mentor  Graphics  2010)  and  Carte  C  (SRC  Computers  Inc.  2014),  respectively,  that  takes 
the  hassle  out  employing  HDL-based  hardware  design.  These  HLL-to-HDL  compilers 
allow  scientists  and  engineers  to  program  HPHCs  using  HLLs  like  C  or  Fortran,  which 
in  the  past,  was  limited  to  only  using  complex  HDLs.  HLL-to-HDL  compilers  support 
enhanced  language  features  such  as  pipelined  loops,  parallel  code  blocks,  synchronization 
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primitives,  and  intellectual  property  interface  (IPI)  calls  to  access  vendor  or  user  IP  cores. 
Figure  2.12  depicts  an  HLL  development  flow  for  HPHC-based  applications. 


software  ,,  hardware 


Figure  2.12  HLL  development  flow 
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The  application  is  partitioned  into  software  modules,  which  are  targeted  for  execution 
on  the  GPP(s),  and  hardware  modules,  which  are  targeted  for  execution  on  the  FPGA(s). 
Developers  can  also  integrate  custom  HDL-based  IP  cores  into  their  design.  During 
development,  the  HLL  software  module  is  compiled  with  a  standard  HLL  compiler  to 
produce  object  files.  The  linker  uses  the  object  files  (including  the  hardware  module 
runtime  interface  code  emitted  by  the  HLL-to-HDL  compiler),  library  files,  and  the 
configuration  bitstream  (treated  as  data  by  the  linker)  to  produce  a  binary  executable. 

The  enhanced  HLL  hardware  module  is  ingested  by  the  HLL-to-HDL  compiler,  which 
emits  HDL  and  the  runtime  interface  code  previously  mentioned.  The  HDL  is  used  by  the 
synthesis  tool  to  produce  a  netlist.  The  netlist,  which  topologically  details  the  low-level 
device  instances  and  connections  needed  to  implement  the  logic  circuit,  is  fed  into  the 
place  and  route  tool.  As  suggested  by  the  name,  the  place  and  route  tool  determines  an 
optimal  way  to  implement  an  instance  of  the  netlist  subject  to  a  set  of  constraints,  e.g., 
area,  timing. 

The  output  of  place  and  route  is  fed  to  the  bitstream  generation  tool  to  produce  an 
FPGA  configuration  bitstream.  The  bitstream,  when  loaded  onto  the  FPGA,  configures 
the  programmable  logic,  interconnection  network,  and  I/O  of  the  FPGA  such  that  it 
implements  the  logic  design  representing  the  computational  kernel.  From  the  viewpoint  of 
the  software  module,  as  suggested  by  the  header  file  block  at  the  top  of  figure  2.12,  the 
hardware  module  “looks  like”  a  parameterized  subroutine  call.  From  the  viewpoint  of 


the  hardware  module  HLL  code,  IP  cores  also  look  like  subroutine  calls  via  the  IPI.  At 
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execution  time,  the  configuration  bitstream  is  extracted  from  the  binary  executable  and 
loaded  onto  the  FPGA.  The  G  PP  then  executes  the  machine  code  instructions  and  “calls” 
the  FPGA-based  kernel  as  needed. 


2.7  Amdahl’s  Law 


Gene  Amdahl,  a  Norwegian  American  computer  architect  who  has  worked  extensively 
on  mainframe  computing  at  IBM  and  later  his  own  companies,  is  best  known  for 
formulating  Amdahl’s  law,  which  states  a  fundamental  limitation  of  parallel  computing 
(IEEE  2014).  Amdahl’s  law  or  Amdahl’s  argument,  is  used  to  find  the  maximum  expected 
improvement  to  an  overall  system  when  only  part  of  the  system  is  improved.  It  is  often 
used  in  parallel  computing  to  predict  the  theoretical  maximum  speedup  using  multiple 
processors,  however,  Amdahl’s  law  also  encompasses  the  domain  of  RC  systems  as  well 
(Amdahl  1967). 

Amdahl’s  law  is  a  model  for  the  relationship  between  the  expected  speedup  of 
parallelized  implementations  of  an  algorithm  relative  to  the  serial  algorithm,  under  the 
assumption  that  the  problem  size  remains  the  same  when  parallelized.  Equation  2.3  gives 
the  formula  for  Amdahl’s  law. 


1 

1  ~  fe  +  fe/ Se 


(2.3) 


where  sQ  is  the  overall  speedup,  fe  is  the  fraction  of  the  system  to  be  enhanced,  and  se  is 
the  speedup  of  the  portion  to  be  enhanced.  For  example,  if  75  percent,  i.e.,  the  fraction  to 
be  enhanced,  of  an  algorithm  can  be  parallelized,  and  25  percent  is  sequential,  according 
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to  Amdahl’s  law,  the  overall  speedup  which  can  be  achieved  on  three  processors,  i.e.,  the 
portion  to  be  enhanced,  is  l/((l-.75)  +  (.75/3))  =  2.  This  means  that  the  overall  speedup, 
s0,  the  program  can  gain  is  twice  as  fast  on  three  processors  than  on  one.  Amdahl’s  law  is 
used  within  this  research  to  statically  analyze  the  overall  speedup  of  the  monolithic  and 
tuned  versions  of  CG.  The  details  are  discussed  in  Chapter  3. 
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CHAPTER  3 
METHODOLOGY 


3.1  Overview 

High  performance  heterogeneous  computers  that  employ  field  programmable  gate 
arrays  (FPGAs)  as  computational  elements  are  known  as  high  performance  heterogeneous 
computers  (HPHCs).  For  floating-point  applications,  these  FPGA-based  processors  must 
satisfy  a  variety  of  heuristics  and  rules  of  thumb  to  achieve  a  speedup  compared  with  their 
software  counterparts.  By  way  of  a  sparse  matrix  conjugate  gradient  iterative  solver,  the 
following  sections  illustrates  the  methods  associated  with  mapping  floating-point  kernels 
onto  HPHCs. 

It  is  worthwhile  looking  at  some  of  the  rules  of  thumb  and  heuristics  used  to  determine 
if  CG  is  a  good  mapping  candidate.  These  have  been  developed  over  the  past  several 
years  based  on  Jackson  State  University’s  HPHC-oriented  research  and  are  described  in 
some  detail  in  (Rice,  Pace,  Gates,  Morris,  and  Abed  2008;  Justin  L.  Rice  and  Morris 
2009).  Considerations  include:  (a)  the  three  p's,  (b)  expected  resource  utilization, 

(c)  control/memory  vs.  compute  intensive,  (d)  monolithicity,  (e)  available  bandwidth, 

(f)  opportunities  for  data  reuse,  (g)  algorithm  design  stability,  (h)  algorithm  efficiency, 
and  (i)  memory  access  patterns.  Section  3.2  gives  more  details  about  these  design 
considerations  relative  to  CG. 
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There  are  currently  two  pairs  (monolithic  and  “tuned”)  of  the  floating-point  sparse 
matrix  CG  iterative  solver  in  reference  with  this  research.  The  two  monolothic  versions 
blindly  mapped  the  entire  CG  algorithm.  The  first  version  executes  strictly  on  software, 
i.e.,  the  GPP.  The  second  utilizes  both  the  GPP  and  the  SRC-7  HPHC.  Section  3.4  details 
the  monolothic  versions  while  also  giving  an  overview  of  the  pitfalls  associated  with  CG 
when  the  rules  and  heurisitcs  of  mapping  floating-point  kernels  onto  HPHCs  are  not 
considered.  The  “tuned”  pair  are  refactored  versions  of  CG.  CG  was  statically  analyzed  to 
determine  where  the  most  computational  expensive  operations  occur,  which  is  MVM. 
Through  the  use  of  a  system  call,  CG  was  redeveloped  to  initiate  MVM  as  a  seperate 
subroutine.  Section  3.5  discusses  both  the  software  and  hardware  versions  of  the  profiled 
CG  algorithm,  a  detailed  discussion  of  MVM,  and  the  analytics  used  for  the  theoretical 
overall  speedup  via  Amdahl’s  Law. 

3.2  The  JSU-team  rules  and  heuristics 

3.2.1  The  three  p’s 

The  three  p’s,  which  expresses  the  crucial  relationship  among  performance,  pipelining, 
and  parallelism,  is  perhaps  the  most  important  mapping  consideration  (Morris  and  Abed 
2013).  The  multiplicative  effect  of  both  pipelining  and  parallelizing  is  suggested  by 
Eq.  3.1: 


performance  oc  pipelined  x  parallelized 


(3.1) 
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Given  the  order-of-magnitude  clock  speed  advantage  of  GPPs  over  FPGAs,  a  candidate 
FPGA  module  must  be  able  to  be  both  pipelined  and  parallelized  if  a  performance 
improvement  is  to  be  realized. 

3.2.2  Expected  resource  utilization 

Another  important  consideration  is  the  expected  FPGA  resource  utilization  of  the 
candidate  module.  Since  floating-point  IP  cores  can  be  quite  large,  the  developer  needs  to 
determine  if  the  candidate  will  even  fit  on  the  FPGA.  The  developer  also  needs  to  consider 
the  needed  local  memory  capacity,  number  of  simultaneous  memory  accesses,  anticipated 
clock  rate  in  the  light  of  complex  routing,  etc.  In  the  case  of  the  CG  solver,  the  limiting 
factor  was  the  amount  of  BRAM  in  the  FPGA  fabric,  with  the  number  of  OBM  banks 
running  a  close  second.  Despite  these  limitation  an  8-wide  parallel  CG  data  path  capable 
of  handling  sparse  matrices  up  to  order,  n  =  8K,  with  up  to  approximately  4M  nonzero 
entries  was  constructed. 

3.2.3  Control/memory  vs.  compute  intensive 

It  is  also  imperative  to  consider  whether  an  algorithm  is  control/memory  intensive 
or  compute  intensive.  The  control  aspect  is  similar  to  the  branching  problem  in  a  GPP, 
and  the  memory  aspect  is  similar  to  a  GPP  where  accessing  memory  data  takes  a 
considerable  amount  of  time  compared  with  arithmetic  operations.  Harkins,  et  al., 
(Harkins,  El-Ghazawi,  El-Araby,  and  Huang  2005)  illustrate  the  importance  of  this 
concept  when  they  show  that  sorting  algorithms  do  not  perform  very  well  on  an  HPHC. 
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The  CG  processor  employs  design  features  that  minimize  the  number  of  control  clauses, 
e.g.,  the  use  of  A;- aligned  CSR  format.  Furthermore,  unlike  a  GPP  memory  hierarchy,  the 
HPHC  memory  organization  does  not  penalize  irregular  memory  accesses  (Abed  and 
Morris  2009).  Therefore,  on  an  FPGA,  CG  appears  to  be  primarily  compute  intensive. 

3.2.4  Monolithicity 

If  the  candidate  FPGA  module  contains  procedure  calls  (is  not  monolithic),  then 
these  calls  have  to  be  inlined  or  the  module  cannot  be  considered  as  a  viable  candidate 
(hardware  cannot  “call”  hardware).  Obviously,  the  ability  to  inline  is  impacted  by  the 
available  FPGA  resources.  In  the  case  of  the  CG  processor,  the  Carte  compiler  provided 
IPI  “calls”  for  some  of  the  lower  level  “routines”  such  as  square  root.  Therefore,  the  first 
CG  kernel  pair  was  effectively  monolithic  and  suitable  for  mapping  onto  an  FPGA.  In 
addition,  a  second  refactored  pair  of  CG  kernels  had  a  software  subroutine  that  called 
MVM  which  is  also  monolithic. 

3.2.5  Available  bandwidth 

The  GPP  to  FPGA  bandwidth  also  deserves  attention.  Obviously,  the  FPGA  memory 
access  and  processing  time  should  be  less  than  the  GPP  memory  access  and  processing 
time.  According  to  Herbordt,  et  al.,  (Herbordt,  Court,  Gu,  Sukhwani,  Conti,  Model, 
and  DiSabello  2007),  when  they  discuss  latency  hiding,  a  design  should  try  to  overlap 
computation  with  communication.  In  the  CG  processor  a  reduction  in  the  bandwidth 
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requirement  was  acheived  by  marshaling  the  data  into  GCM  banks  and  then  using  DMA 
to  load  the  OBM  and  BRAM  memory  used  during  the  FPGA-based  computation. 

3.2.6  Opportunities  for  data  reuse 

Algorithms  that  have  a  significant  potential  for  data  reuse  may  be  suitable  FPGA 
module  candidates.  In  the  case  of  the  CG  iterative  solver,  the  matrix  A  and  vector  b  are 
reused  during  every  iteration.  This  had  the  effect  of  amortizing  the  transfer  costs  across  all 
iterations. 

3.2.7  Algorithm  design  stability 

Since  mapping  an  algorithm  to  an  FPGA  is  not  the  easiest  of  tasks,  it  is  imperative  to 
make  sure  that  the  algorithm  is  as  stable  as  possible.  If  the  algorithm  is  altered  while  in 
the  midst  of  a  hardware  implementation  process,  one  could  easily  discover  that  the  new 
algorithm  no  longer  fits  onto  the  FPGA,  or  that  it  can  no  longer  deliver  on  the  promised 
speedup.  The  CG  iterative  solver  has  been  around  for  over  six  decades;  obviously  the 
authors  did  not  anticipate  any  algorithm  design  modifications. 

3.2.8  Algorithm  efficiency 

Another  application  design  consideration  is  to  make  sure  an  efficient  algorithm  is 
employed.  For  example,  Cramer’s  rule,  which  has  exponential  complexity,  0((n  +  1)!), 
might  run  faster  if  implemented  on  an  FPGA.  However,  Gaussian  elimination,  with 
complexity,  0(n3),  is  a  much  more  efficient  algorithm.  In  this  case,  one  would  use  a 
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software  solution  rather  than  map  the  inefficient  algorithm  onto  an  FPGA.  In  the  case  of 
the  CG  algorithm,  Section  2.3  identifies  CG  as  being  one  of  the  most  efficient  algorithms 
for  solving  large  sparse  systems  of  linear  equations. 

3.2.9  Memory  access  patterns 

Another  design  consideration  is  the  memory  access  pattern.  Unlike  a  GPP  memory 
hierarchy,  the  HPHC  memory  organization  does  not  penalize  irregular  memory  accesses 
(Abed  and  Morris  2009).  Since  the  CG  processor  uses  a  sparse  matrix,  it  clearly 
demonstrates  an  irregular  memory  access  pattern  and  is  a  good  mapping  candidate. 

3.3  High-level  design 

3.3.1  CG 

The  high-level  design  for  the  CG  solver  is  shown  in  Fig.  3.1.  It  consists  of  four  major 
components:  a  main  routine  and  matrix  support  libraries;  several  strictly  diagonally 
dominant  sparse  matrices,  Ai . . .  Am:  the  software  or  hardware  (FPGA-based)  CG  solver; 
and  the  output  result  and  statistics  files,  xx . . .  x,m  and  ©x . . .  ©m.  The  bt  vectors  are 
shown  as  inputs,  but  for  the  experiments  in  this  research  they  are  generated  from  a  known 
x  vector  at  runtime.  The  main  routine  is  a  driver  program,  which  essentially  measures 
how  long  it  takes  for  CG  to  solve  each  set  of  equations.  The  coordinate-format  matrices 
are  read  in  using  the  Matrix  Market  I/O  library  (NIST  2004)  and  converted  to  CSR  format 
using  Saad’s  SPARSKIT  library  (Saad  2009).  The  software  CG  kernel  implementation 


50 


35 


Figure  3.1  High-level  CG  design 

is  based  on  the  algorithm  shown  in  Figure  2.6,  and  the  FPGA-based  CG  kernel  will  be 
described  in  Section  3.4. 

A  compile-time  decision  selects  either  the  software  or  FPGA-based  version  of  CG.  At 
runtime,  main  reads  in  each  coordinate-format  matrix,  converts  it  to  CSR  format,  and  uses 
a  known  x  vector  to  generate  b.  It  then  invokes  the  selected  CG  kernel  sending  matrix,  A 
(val,  col,  and  ptr),  starting  point  x:0),  and  constant  vector  b.  After  convergence,  CG 
stores  the  result  and  returns.  The  main  routine  writes  the  solution  to  the  results  file;  it  also 
writes  the  input  matrix  name,  number  of  iterations,  and  wall  clock  execution  time  to  the 
statistics  file  and  then  terminates. 

3.3.2  MVM 

To  recap,  there  were  four  CG  algorithms  that  were  mapped  onto  the  HPHC.  The  first 
pair  was  a  monolithic  version  which  comprised  of  a  software  and  hardware  version.  The 
second  pair  was  ’’tuned”  and  also  had  a  software  and  hardware  version,  however,  this 
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kernel  involved  a  subroutine  call  that  implemented  MVM;  software  or  hardware.  In  other 
words,  the  high-level  design  for  the  ’’tuned”  CG  solver  is  very  similar  to  Figure  3.1, 
however,  in  this  kernel  CG  invokes  MVM  to  either  compute  the  software  or  hardware 
version  as  illustrated  in  Figure  3.2. 


software  FPGA 


Figure  3.2  High-level  MVM  design 


3.4  Monolithic  version 

3.4.1  Software 

As  stated  in  Section  1,  our  research  began  by  blindly  mapping  the  entire  CG  algorithm 
onto  hardware  without  any  regard  to  the  heuristics  that  previous  research  has  shown  to 
highly  impact  performance.  The  monolithic  CG  algorithm  that  was  executed  in  software, 
i.e.,  the  GPP,  is  essentially  the  algorithm  depicted  in  Figure  2.6. 
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3.4.2  Hardware 


The  monolithic  hardware  version  of  CG  is  shown  in  Figure  3.3. 
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44: 
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46: 
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48: 
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algorithm  CGH  W(kval,  kcol,  kptr.  b,  b2,  n,  knz.  x) 

parBegin 

BUF  DM Aqcm  i  :Obm  (kval,  stripe-8) 
BUFJ)MAGCm2:obm  (kcol,  stripe-8) 

parEnd 

//  only  two  GCM  banks  so 

parBegin 

STREAMJDMAgcmgbram  (kptr) 
STREAM_DMAgcm2:bram  (b) 

parEnd 

//  parallel  DMA  limited  to  2 

for  /  in  [l,n]do 

//  set  up  for  repeat  loop 

x,  4-  0 

//  x  =  0 

Pi 4-  r i  4-  b, 

//  therefore  p  =  b 

M  AC(r,.  r*.  rTrold) 

end  for 

//  calculate  the  dot  product 

d  4-  0 

repeat 

//  loop  index 

pi  4-  ...  4—  pll  4-  p 

//  1 1  copies  of  p  vector 

parBegin 

//  compute  v  =  Ap 

Pl: 

for  i  in  [l,fcn2]do 

a  4—  (al  •  •  •  a8)  stripe-8  from  kvali 
j  4—  ( j  1  ■  •  ■  j8)  stripe-8  from  kcol, 

P  4—  {plji  ■  ■  ■  P&j8) 

Vfifo  4-  dotSTree  (a,  p) 
end  for 

//  feed  values  stream 

P2: 

for  i  in  [l,n]  do 

Gfifo  4—  kptri+i  —  kptri 

end  for 

//  feed  counts  stream 

P3: 

SfIFO  4“  STREAM  (  FfII-'O  •  CpiFo) 

//  streaming  accumulator 

p4: 

fori  in  [l,n]do 

Vi  4—  .S’fifo 

MAC(vj,  p9,,  pTv) 

end  for 
parEnd 

//  the  dotN’s 

a  4—  rTrold/pTv 
for  i  in  [l,n]do 

//  step  size 

x,  4-  Xi  4-  oplO, 

//  next  point 

r j  4—  r;  —  avj 

//  residual 

MAC(r,;,  r,,  rTrnew) 
end  for 

//  new  dot  product 

/?  4—  rTrnew  / rTrold 

//  projection  operator 

rTrold  4—  rTrnew 
fori  in  [1,  n]  do 

//  for  next  iteration 

p,  4-  r;  -F  ySplli 

end  for 

//  new  search  direction 

r2b2  4-  y/lrTrold) b2 

//  residual  norm 

d  4 —  d  1 

until  ( r2b2  <  e).OR.  (d  >  dmax) 

//  next  iteration 

STREAM.DMABram:GCM2  (x) 

//  all  done,  stream  x  back  to  GCM 

end  algorithm 
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Figure  3.3  monolithic  hardware  version  of  CG 
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The  loops  on  line  35  and  42  are  serialized  loops,  therefore,  they  cannot  be  done  in 
parallel  with  the  dot  product,  i.e.,  line  37.  Consequently,  these  serial  loops  will  incur 
costly  computational  penalties  that  significantly  decreases  the  speedup.  A  block  diagram 
is  also  presented  here  to  further  illustrate  the  algorithm. 


Figure  3.4  CG  hardware  process 
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3.5  “Tuned”  version 


3.5.1  Software 

This  algorithm  is  essentially  the  same  as  the  discussion  held  in  the  previous  section, 
however,  this  software  features  a  subroutine  call  to  MVM  as  shown  in  Figure  3.5.  A  static 

l:  algorithm  CGSW( Ax. b) 

2:  pi  <—  b  —  Axo 

3:  r0  4—  Pi 

4:  A  <-  £  +  1 

5:  over^norm  ^  l/||b|| 

6:  rTrold  r?r0 

7:  k  <-  1 

8:  while  (A  >  e)  .and.  ( k  <  km ax)  do 

9:  vap  <r-  mvm(a.  p) 

10:  ak  <-  rTrold/pl vap 

ll:  xk  <r-  xfc_i  +  akPk 

12:  rfc  rfc_i  -  akvap 

13:  rTrncw  <-  r^rfc 

14:  f3k  <r-  rTrncw/rTrM 

15:  rTr0id  fTv  new 

•6:  Pfc+l  <-  rfc  +  /5fcPfc 

17:  A  «—  ||rfc||  •  over6n0rm 

18:  k ++ 

19:  end  while 

20:  return  (Xfc_i) 

21:  end  algorithm 


Figure  3.5  tuned  software  version  of  CG 


analysis  of  CG  showed  that  the  matrix- vector  multiply,  which  is  0(n2),  was  the  most 
expensive  part  of  the  kernel.  Therefore,  CG  was  refactored  such  that  matrix-vector 
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multiply  was  a  separate  routine.  Subsequently,  gprof  was  used  to  profile  the  software 
version  of  the  code.  Not  surprisingly,  matrix  vector  multiply  consumed  nearly  67 
percent  of  the  runtime, therefore,  according  to  Amdahl’s  law,  the  fraction  to  be  enhanced 
fe  =  0.67. 

3.5.2  Hardware 

The  tuned  hardware  CG  algorithm  is  exactly  as  previously  discussed  ,  except  a  call  to 
MVM  is  a  call  to  hardware,  as  shown  in  Figure  3.6.  The  objective  of  mapping  algorithms 
onto  HPHCs  is  to  obtain  a  speedup  relative  to  the  performance  on  a  GPP.  Anticipated 
overall  speedup  can  be  quantified  via  Amdahl’s  Law  (J.  L.  Hennessy  and  D.  A.  Patterson 
2003) 

1 

1  ~  fe  +  fe / Se  ’ 

where  sQ  is  the  overall  speedup,  fe  is  the  fraction  of  the  system  to  be  enhanced,  and  se  is 
the  speedup  of  the  portion  to  be  enhanced. 

The  SRC-7  MAPstation  used  as  the  target  platform  has  two  Intel  Xeon  processors 
and  two  Altera  FPGAs  running  at  3.0GHz  and  150MHz,  respectively.  In  the  absence  of 
parallelization  or  pipelining,  the  FPGA  runs  20x  slower  than  the  GPP,  i.e.: 

150  Mhz  1 

Se  “  3000 Mhz  ~  20 
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29 

30 

31 


algorithm  MvMHW(kval,  kcol,  kptr,  v,  p,  *  first) 

if  (*first)  then  //  grab  the  A  matrix 

parBegin  //  only  two  GCM  banks  so 

BUF  DMAqcMi  :OBM  (kval,  stripe-8) 

BUF_DMAGCm2:obm  (kcol,  stripe-8) 
parEnd 

parBegin  //  parallel  DMA  limited  to  2 

STREAM_DMAGcmi:bram  (kptr) 

STREAM  DMAgcm2:BRam  (p) 

parEnd 


else 

STREAM_DMAGcm2:BRam  (p) 

end  if 

pi  4—  ...  4—  p8  4—  p 

parBegin 

Pi  = 

fori  in  [1, knz]  do 

a  4—  (al  •  •  •  a8)  stripe-8  from  kvaf 
j  4—  ( j  1  •  ■  •  j 8)  stripe-8  from  kcolt 
P< —  (plji  •••p8j8) 

Vfifo  <-  dot8Tree  (a,  p) 
end  for 


//  matrix  A  is  already  in  memory 
//  still  need  p 


//  compute  v  =  Ap 
//  feed  values  stream 


P2: 

fori  in  [l,n]do 

Cfifo  kptr i+i  —  kptri 

end  for 

P3: 

5'fifo  <—  5Zstream(^/fifo>  Cfifo) 
Pi: 

for  /  in  [l,n]do 
Vi  <—  ■S'fifo 

end  for 
parEnd 

STREAM _DMAbram:Gcm2  (v) 

end  algorithm 


//  feed  counts  stream 

//  streaming  accumulator 
//  the  dotN’s 


//  all  done,  return  results 


Figure  3.6  tuned  hardware  version  of  CG 


The  width  of  the  dot  product  tree  is  8  and  the  depth  of  the  pipeline  is  approximately  50 
(recall  that  the  speed-up  of  a  pipeline  for  a  large  number  of  items  is  the  same  as  the  depth). 
Thus,  parallization  and  pipeling  results  in 


s, 


20 
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There  are  some  serial  operations  and  latencies  that  cannot  be  hidden  during  computation, 
so  a  conserative  value  is  se  ~  10.  Therefore,  according  to  Amdahl’s  Law,  an  overall 
speedup  sa  =  1/(0. 33  +  0.67/10)  =  2.5  is  anticipated. 
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CHAPTER  4 

RESULTS,  ANALYSIS,  AND  COMPARISON 


4.1  Results 

4.1.1  Monolithic  results 

As  shown  in  table  4.1,  a  monolithic  approach  might  not  produce  a  speedup.  In  our 
example,  it  produced  a  slowdown  as  depicted  by  the  average  of  six  samples  of  matrices. 
This  is  to  be  expected  since  the  monolithic  version  of  CG  disregarded  the  rules  and 
heuristics  when  mapping  algorithms  onto  HPHCs. 


Table  4.1  Slowdown  using  monolithic  method 


Matrix  Name 

Size  (NZ) 

Gw  1/iS) 

Gw  (/^s) 

B  Ol 

fassl.128.mtx 

438 

66381 

258797 

3.89 

fass2.500.mtx 

2298 

55247 

435985 

7.89 

fassl.1024.mtx 

4918 

396983 

286116 

0.72 

fassl.2048.mtx 

10038 

74304 

300623 

4 

fass4.4096.mtx 

26270 

398214 

3759980 

9.4 

fassl.8192.mtx 

40758 

150452 

490218 

3.25 

Avg.  slowdown 

4.85 

44 
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4.1.2  Tuned  results 

In  contrast,  following  the  heuristics,  discussed  in  Section  3.2,  it  is  possible  to  achieve  a 
speedup.  Due  to  technical  difficulties,  our  target  platform  is  undergoing  repairs  at  the 
vendor’s  site,  therefore,  it  is  not  possible  to  show  the  results  of  the  refactored  CG  hardware 
mapping.  Nonetheless,  previous  research  mapping  scientific  kernels  such  as  Transactions 
on  Parallel  and  Distributed  Systems’  (TPDS)  paper  “Mapping  a  Jacobi  Iterative  Solver 
onto  a  High  Performance  Heterogeneous  Computer”  resulted  in  a  3-fold  increase  in 
performance  as  illustrated  in  Fig  4.1. 


Execution  Time 


-^-^-^roforococoGJ-{^4^-{^cncncna)CDC^~vj-vj~vJCoc»co 

^l^^^l^'oj^'Kj'co^'ro'oj^'ro'co^'ro'co^'ro'oj^'Kj'co 


n  (trial  #) 


Figure  4. 1  Run  time  comparison 


The  refactored  CG  algorithm  that  calls  the  hardware  MVM  is  in  close  agreement  with 
these  speedups  based  upon  the  static  analysis  in  Section  3.5. 
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CHAPTER  5 

CONCLUSIONS  AND  FUTURE  WORK 

The  research  thus  far  has  suggested  strong  evidence  of  an  overall  speed  up  of 
conjugate  gradient  (CG)  when  the  rules  and  heuristics  developed  by  the  Jackson  State 
University’s  High  Performance  Heterogeneous  Computers  (HPHC)  research  group  are 
adhered  to.  The  negative  impact  of  performance  has  also  been  illustrated  when  these  rules 
and  heuristics  are  ignored  (Hawkins,  Alfarra,  Morris,  and  Abed  ).  Finding  the  “sweet 
spot”  when  mapping  scientific  kernels  onto  HPHCs  has  indeed  helped  in  solidifying  the 
veracity  of  the  University’s  extensive  research  within  the  HPHC  research  domain  (Rice, 
Abed,  and  Morris  2009;  Rice,  Pace,  Gates,  Morris,  and  Abed  2008;  Justin  L.  Rice  and 
Morris  2009;  Morris,  Silas,  and  Abed  2012;  Morris  2006).  Due  to  our  target  platform 
being  upgraded  from  ARO  funds  at  this  current  time,  further  progress  in  the  research  had 
to  be  postponed  until  the  HPHC  returns  from  the  vendor.  Future  work  includes  finishing 
the  hardware  implementation  of  matrix  vector  multiplication  (MVM)  and  calculating  the 
actual  overall  speedup  of  CG. 
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CHAPTER  6 
APPENDIX 
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APPENDIX  A 

APPLICATION  DEVELOPMENT  FILES 

A.l  Monolithic  Conjugate  Gradient  Development 

A.  1.1  main.c 

1  /  *  main  .  c 

2  sparse  matrix  jacobi  iterative  solver  for  SRC— 7  hardware 

3  Gerald  R.  Morris,  gerald.r.morris@us.army.mil,  1  Mar  10 

4  refactored  by  Gerald  R.  Morris  ,  7  Apr  2011  to  use  a 

5  residual  norm— based  convergence  test 

6  Refactored  by  Jamory  D.  Hawkins  and  Anas  Alfarra  ,  Dec  2013 

7  to  use  conjugate  gradient  as  the  iteration  method 

8 

9  There  are  actually  two  options  when  building  this  code 

10  The  options  are  selected  via  defines  in  the  eg .  h  header  file 

11  #define  SWVER  or  #define  HWVER 

12  The  first  is  the  software  version  which  employs  the  standard 

13  nested  loop  approach  to  do  the  calculation  serially. 

14  The  second  is  the  k— aligned  hardware  version  ,  which  employ  loops 

15  wherein  the  matrix  is  dealt  with  in  chunks  of  k  elements  at  a  time  . 

16  k— alignment  requires  an  exact  multiple  of  k  elements  in  a  row 

17  The  k— alignment  process  simply  pads  the  matrix  with  0’s 

18  */ 

19  #include  <stdio.  h> 

20  #include  <stdlib.h> 

21  #include  <sys/time.h> 

22  #include  <string.  h> 
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23 

#include  <stdint. h> 

24 

#include  <math  .  h> 

25 

#include  ”cg . h” 

26 

#  include  ’’mmio.h” 

27 

#include  ’’sparsekit.h” 

28 

#  include  ” usee . h” 

29 

# if def  HWVER 

30 

#include  <libmap.h> 

31 

#endif 

32 

33 

//  define  this  when  we  want  a 

verbose  mode  (mostly  debug  stuff) 

34 

//#  define  VURBOZE 

35 

36 

int  main  (int  arge  ,  char  *argv 

[]) 

{ 

37 

38 

//  matrices  are  stored  in 

MatrixMarket  coordinate  format 

39 

//  we  use  the  MatrixMarket 

mmio  (I/O)  routines 

40 

char  fname  [255]; 

// 

MatrixMarket  file 

41 

FILE  *fmm; 

// 

MatrixMarket  file  pointer 

42 

MMLtypecode  matcode  ; 

// 

MatrixMarket  meta— data 

43 

int  n ; 

// 

#  matrix  rows 

44 

int  nc ; 

// 

#  matrix  columns 

45 

i n  t  nnz ; 

// 

#  non— zero  elements 

46 

float  *acoo  ; 

// 

matrix  values 

47 

i n t  *  icoo  ; 

// 

matrix  row  indices 

48 

int  *jcoo ; 

// 

matrix  column  indices 

49 

50 

//  eg  uses  compressed  sparse 

row  format 

51 

float  * val  ; 

// 

CSR  sparse  matrix  (input  matrix) 

52 

int  *  col  ; 

// 

CSR  column  index 

53 

int  *  ptr ; 

// 

CSR  row  pointer 

54 
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float  *x ; 

// 

x  vector 

56 

float  *b ; 

// 

constant  vector 

57 

float  b2n_l  ; 

// 

(2  — norm  of  b  vector  )"(  —  1) 

58 

float  r2nb2n  ; 

// 

ratio  of  2— norms  (convergence  test) 

59 

60 

char  result [ 1024] ; 

// 

capture  results 

61 

i  n  t  iters; 

// 

number  of  iterations  needed 

62 

FILE  *  fres  ; 

// 

results  file  pointer 

63 

64 

double  tO  ,  tl  ,  t2  ,  t3  ; 

// 

to  capture  wall  clock  run  time 

65 

i  n  t  i  ,  j  ,  k  ; 

// 

loop  indices 

66 

67 

# if def  HWVER  //  if  it  ’  s 

the 

hardware  version  ,  we  need  these 

68 

//  for  k— aligned  CSR 

data 

69 

float  *kval  ; 

// 

CSR  matrix  k— aligned  values 

70 

i n t  * kcol ; 

// 

CSR  k— aligned  column  index 

71 

i  n  t  *  kptr  ; 

// 

CSR  k— aligned  row  pointer 

72 

i n  t  knz ; 

// 

k— aligned  #  of  non— zeros 

73 

int  kvalsz  ; 

// 

allocation  size  of  kval  (bytes) 

74 

int  kcolsz ; 

// 

allocation  size  of  kcol 

75 

int  kptrsz  ; 

// 

allocation  size  of  kptr 

76 

77 

float  *D_1  ; 

// 

DA(  —  1)  (1/aii  as  a  vector) 

78 

int  D_lsz  ,  bsz  ,  xsz  ; 

// 

allocation  size  of  vectors 

79 

80 

//  we’ll  allocate  segments  using  both  OBCM  banks 

81 

//  and  marshall  the  MAP 

data 

in  those  banks  (speed) 

82 

gcm_seg_desc_t  *sdl  ; 

// 

OBCM  segment  descriptor  pointers 

83 

gcm_seg_desc_t  *sd2; 

// 

for  bank  1  and  bank  2 

84 

uint64_t  sdl sz  ; 

// 

segment  sizes 

85 

uint64_t  sd2sz ; 

86 


//  address  of  the  MAP  data  allocated  from  the  2  segments  above 


87 

gcm_addr_t 

kvalA  ; 

// 

OBCM 

address 

of 

CSR  matrix  values 

88 

gcm_addr_t 

kcolA  ; 

// 

OBCM 

address 

of 

CSR  column  indices 

89 

gcm_addr_t 

kptrA  ; 

// 

OBCM 

address 

of 

CSR  row  pointers 

90 

gcm_addr_t 

bA; 

// 

OBCM 

address 

of 

constant  vector 

91 

gcm_addr_t 

D.1A; 

// 

OBCM 

address 

of 

1/ aii  n— vector 

92 

gcm_addr_t 

xA; 

// 

OBCM 

address 

of 

approximate  solution 

93 

94  //for  MAP  allocation 

95  int  nmaps  =  1; 

96  int  mapno  =  0; 

97  #endif  //  #ifdef  HWVER 

98 

99  #ifdef  VURBOZE  //  format  strings  ,  etc.,  when  we  run  in  verbose  mode 


100 

char 

*  f  1  =”\n%5d  : 

%9.2e  %9.2e  %9.2e  %9.2e\n”; 

101 

char 

*  f2  =” 

(%4d,%4d)  (%4d,%4d)  (%4d,%4d) 

(%4d,%4d )  ” ; 

102 

char 

*  f  3  =”  \  n%5d  :%9.2e%9.2e%9.2e%9.2  e%6d%5d%5d%5d  ” ; 

103 

char 

*rl  =” 

val 

col  ” ; 

104 

char 

*kl  =” 

kval 

kcol ” ; 

105 

char 

*r2=”  0 

12  3  0 

1  2  3”; 

1  DR 

char 

-|.  j*  ^ 

1  UO 

107 

char 

^indent  =  ” 

108 

char 

*  spe  ; 

109 

int 

head  =  16; 

//  1  st  few  matrix  values 

110  #endif  //  #ifdef  VURBOZE 

111 

112  tO  =  usec();  //  first  start  time 

113 

114  //  matrix  file  name  is  specified  on  the  command  line 

115  if  (argc  ==  2)  { 

116  strcpy  (fname  ,  argv  [  1  ]  )  ; 

117  }  else  { 

118  fpri  ntf  (  stderr  ,”***  Usage:  ./eg  MM_file\n”); 


and  usage 

//  the  number  of  MAPs  needed 
//  first  MAP  number  is  0 
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119  exit  ( 1 ) ; 

120  } 

121 

122  //  open  the  input  and  result  files 

123  if  ( ( fmm  =  fopen(fname,  ”r”))  ==  NULL)  { 

124  fpri  ntf  (  stderr  ,  ’’Cannot  open  file  %s\n”,  fname); 

125  exit  ( 1 ) ; 

126  } 

127  fprintf  (  stderr  Reading  %s\n”  ,  fname  ) ; 

128  if  ((fres  =  fopen  (’’result”,  ”w”))  ==  NULL)  { 

129  fprintf  (stderr  ,  ’’failed  to  open  file  ’result  ’\n”); 

130  e  x  i  t  ( 1 ) ; 

131  } 

132 

133  //  get  matrix  information  from  MatrixMarket  file 

134  mm_read_banner  (fmm,  &matcode  ) ; 

135  mm_read_mtx_crd_size  (fmm,  &n  ,  &nc  ,  &nnz  ) ; 

136  f  p  ri  n  tf  (  stderr  ,”  %dx%d  %s  with  %d  nonzero  values\n”, 

137  n,nc,mm_typecode_to_str(  mate  ode  )  ,  nnz  ) ; 

138 

139  //  allocate  coordinate  format  arrays  now  that  sizes  are  known 

140  acoo  =  (float  *)  malloc(nnz  *  sizeof  (  float  )) ; 

141  icoo  =  (int  *)  malloc(nnz  *  sizeof  (  int  )) ; 

142  jcoo  =  (int  *)  malloc(nnz  *  sizeof  (  int  )) ; 

143 

144  //  read  MatrixMarket  coordinate —format  matrix  data 

145  for  (i=0;  i<nnz;  i ++)  { 

146  fscanf(fmm,  ”%d  %d  %g\n”,  &icoo  [  i  ]  ,  &jcoo[i],  &acoo[i]); 

147  icoo[i] - ;  //  adjust  from  1— based  to  0— based  for  C  arrays 

148  j coo  [  i ] - ; 

149  } 


150 


f  c  1  o  s  e  ( fmm ) ; 
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151 

152  #ifdef  VURBOZE  //  dump  up  to  head  elements  of  input  matrix 

153  printf(”%s%s  V’COO:  ”,r3); 

154  for  (i=0;  i<MIN(  nnz  ,3  *  head  ) ;  i  +=4)  { 

155  printf  (fl  , 

156  i  ,  acoo  [i]  ,  acoo  [  i  + 1  ]  ,acoo[i+2]  ,acoo[i  +  3]); 

157  printf(f2,icoo[i],jcoo[i],icoo[i+l],jcoo[i+l], 

158  icoo[i+2],jcoo[i+2],icoo[i+3],jcoo[i+3]); 

159  } 

160  printf(”  %s\n”,spc); 

161  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

162  # endif  //  #ifdef  VURBOZE 

163 

164  //  allocate  CSR  array  and  convert  to  CSR  format 

165  val  =  (float  *)  malloc(nnz  *  sizeof  (  float  )) ; 

166  col  =  (int  *)  malloc(nnz  *  sizeof  ( int  )) ; 

167  ptr  =  (int  *)  malloc((n  +  l)  *  sizeof  ( i  nt  )) ; 

168  coocsr  (n  ,  nnz  ,  acoo  ,  icoo  ,  jcoo  ,  val  ,  col  ,  ptr  ) ;  //  SPARSKIT  routine 

169 

170  //  have  CSR  data  so  deallocate  MatrixMarket  arrays 

171  free  (  acoo  ) ; 

172  free(icoo); 

173  free(jcoo); 

174 

175  #ifdef  VURBOZE  //  show  the  CSR  format  matrix  data 

176  //  val  ,  and  col  vectors 

177  printf  (”%s%s\n”  ,  indent  ,rl  ); 

178  printf  (”%s%s\n”  ,  indent  ,  r2  ) ; 

179  printf (”%s%s”,”CSR:  ”,r3  ); 

180  for  (i=0;  i<MIN(  nnz  ,3  *  head  ) ;  i +=4)  { 

181  printf  (f3  , 


182 


i  ,val[i]  ,val[i+l],val[i+2],val[i+3], 
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183  col[i],col[i+l],col[i+2],col[i+3]); 

184  } 

185  printf(”  %s\n”,spc); 

186 

187  //  and  ptr  array  (sideways) 

188  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

189  printf  (”  i  :  ”) ; 

190  for  (i=0;  i  <=MEN( n  ,  head  ) ;  i ++)  { 

191  printf  (”%5d”  ,  i  ) ; 

192  } 

193  printf(”  %s\n”,spc); 

194  printf  (”  ptr  :  ”) ; 

195  for  (i=0;  i  <=MIN( n  ,  head  ) ;  i  ++)  { 

196  printf  (”%5d”  ,  ptr  [  i  ]) ; 

197  } 

198  printf(”  %s\n”,spc); 

199  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

200  # endif  //  #ifdef  VURBOZE 

201 

202  #ifndef  HWVER 

203  //  SWVER:  allocate  solution  and  constant  vector 

204  x  =  (float  *)  malloc  (n  *  sizeof  (  float  )) ; 

205  b  =  (float  *)  malloc  (n  *  sizeof  (  float  )) ; 

206 

207  #  e  1  s  e 

208  //  HWVER:  allocate  and  load  the  matrix  into  cache  — aligned  buffers. 

209  //  as  part  of  software  /  hardware  codesign  tradeoffs  the 

210  //  k— alignment  of  A  will  be  done  here  in  software 

211  //  the  MAP  wants  the  CSR  vectors  to  be  aligned  on  cache 

212  //  boundaries.  We  can’t  place  this  restriction  on  upper— level 

213  //  callers  ,  so  we  will  marshall  the  data  here  into 

214  //  properly  aligned  arrays.  Since  we  have  to  do  the  memcopy 
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216 

217 

218 

219 

220 

221 

222 

223 

224 

225 

226 

227 

228 

229 

230 

231 

232 

233 

234 

235 

236 

237 

238 

239 

240 

241 

242 

243 

244 

245 

246 


//  anyway,  we’ll  just  go  ahead  and  do  the  k— alignment  needed 
//  by  our  MAP  routine  here  in  software. 

//  figure  out  the  growth  in  kval  and  kcol  due  to  the 
//  k— alignment  processing,  i.e.,  calculate  knz 
knz  =  0; 

for  (i=0;  i<n;  i++)  {  //  loop  across  all  matrix  rows 

for  (j  =  ptr[i  ];  j  <ptr  [  i  + 1  ] ;  j ++)  { 

knz++;  //  count  #  non— zeros 

} 

while  ((knz*%K)  !=  0)  {  //  pad  with  0’s  if  necessary 

knz++;  //  to  make  knz  a  multiple  of  K 

} 

} 

//  now  we  have  the  knz  value  ,  so  we  can  calculate  array  sizes 
//  use  SALIGN  macro  (cg.h)  to  avoid  cache  boundary  problems 
kvalsz  =  SALIGN  (knz*sizeof(float)  ,  CACHEALIGN ) ; 
kcolsz  =  SALIGN(knz*sizeof  (  int  ),  CACHEALIGN); 
kptrsz  =  SALIGN((n  +  l)*  sizeof  ( int  )  ,  CACHEALIGN); 

D_1  sz  =  SALIGN(n*  sizeof  (  float  ), CACHEALIGN); 
xsz  =  SALIGN(n*  sizeof  (  float  ),  CACHEALIGN); 
bsz  =  SALIGN(n*  sizeof  (  float  ),  CACHEALIGN); 

//  allocate  arrays  for  the  k— aligned  CSR  matrix 
kval  =  (  float  *)  Cache_Aligned_Allocate  (kvalsz); 

kcol  =  (int*)  Cache_Aligned_Allocate  (kcolsz); 

kptr  =  (int*)  Cache_Aligned_Allocate  (kptrsz); 

D_1  =  (float*)  Cache_Aligned_Allocate  (D_lsz); 

//  also  need  OBCM  bank  segment  descriptors 

//  to  speed  up  MAP  processing,  place  data  in  separate  banks 
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247  //  kval  ,  kptr  ,  and  D_1  go  into  OBCM  bank  1 

248  //  kcol  ,  b,  and  x  go  into  OBCM  bank  2 

249  //  per  SRC7CPEGuide  ( pg  57),  must  start  on  8— byte 

250  //  boundaries  and  be  allocated  in  length  that  is  a  multiple  of  8 

251  //  this  is  already  guaranteed  because  of  our  cache  aligned  sizes 

252  sdlsz  =  (  uint64_t )  kvalsz  +  (  uint64_t )  kptrsz  +  (  uint64_t )  D.lsz  ; 

253  sd2sz  =  (  uint64_t )  kcolsz  +  (  uint64_t )  bsz  +  (  uint64_t )  xsz  ; 

254  if  (  gcm.al loc ate  _seg _by  .bank  (  sd  1  sz  ,  1,  &sdl))  { 

255  f  p  r  i  n  t  f  (  s  t  derr  ,”Bank  1  OBCM  segment  allocation  failed\n”); 

256  e  x  i  t  ( 1 ) ; 

257  } 

258  if  (  gcm.allocate  _seg .by  .bank  (  sd2sz  ,  2,  &sd2))  { 

259  fprintf  (  stderr  ,”Bank  2  OBCM  segment  allocation  failed\n”); 

260  e  x  i  t  ( 1 ) ; 

261  } 

262 

263  //  now  allocate  the  buffer  addresses  in  OBCM  segments 

264  if  (!(kvalA  =  gcm.allocate  .from _seg  (  kvalsz  ,  sd  1  )  ) )  { 

265  fprintf  (  stderr  , "OBCM  allocation  for  kval  failed\n”); 

266  e  x  i  t  ( 1 ) ; 

267  } 

268  if  (!(kcolA  =  gcm.allocate  .from.seg  (  kcolsz  ,  sd2  )) )  { 

269  fprintf  (  stderr  ,”OBCM  allocation  for  kcol  failed\n”); 

270  e  x  i  t  ( 1 ) ; 

271  } 

272  if  ( !  (  kptrA  =  gcm.allocate  .from.seg  (  kptrsz  ,  sd  1  )  ) )  { 

273  fprintf  (  stderr  /’OBCM  allocation  for  kptr  failed\n”); 

274  e  x  i  t  ( 1 ) ; 

275  } 

276  if  (!(bA  =  gc m _al locate  .from _seg  ( bsz  ,  sd2  )) )  { 

277  fprintf  (  stderr  /’OBCM  allocation  for  b  failed\n”); 


278 


exit  ( 1 ) ; 
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279 

} 

280 

if  (!(D_1A  =  gcm.allocate  _from_seg  (  D_lsz  ,  sdl  )))  { 

281 

fprintf  (  stderr  ,”OBCM  allocation 

for  D_1  failed\n”); 

282 

exit  ( 1 ) ; 

283 

} 

284 

if  (!(xA  =  gc  m  _al  locate  _from_seg  ( xsz 

» sd2  ) ) )  { 

285 

fprintf  (  stderr  ,”OBCM  allocation 

for  x  failed \n”); 

286 

exit  ( 1 ) ; 

287 

} 

288 

289 

//  k— align  the  kval  ,  kcol  ,  and  kptr 

arrays  and  compute  D_1 

290 

//  the  meaning  of  kptr  is  slightly  different  than  ptr 

291 

//  recall  ,  MAP  will  be  grabbing  kcol 

and  kval  elements 

292 

//  K  per  clock  cycle  ,  so  kptr  expresses  things  in  groups  of  K 

293 

knz  =  0; 

294 

for  (i=0;  i<n;  i ++)  { 

//  all  matrix  rows 

295 

kptr  [  i  ]  =  knz /K; 

//  k— aligned  ptr  value 

296 

for  (j  =  ptr[i  ];  j  <p  tr  [  i  + 1  ] ;  j ++) 

{ 

297 

kval  [  knz  ]  =  val  [  j  ] ; 

298 

if  ( i ==  c o 1 [ j ] )  { 

//  aii  (diagonal  value) 

299 

D_1  [  i  ]  =  1  /  val  [  j  ]; 

//  calculate  1/aii 

300 

} 

301 

kcol[knz++]  =  col [ j ] ; 

302 

} 

303 

while  ((knz%K)  !=  0)  { 

//  pad  with  0’s 

304 

kval  [knz]  =  0.0; 

//  to  fill  a  group  of  K 

305 

kcol[knz++]  =  0; 

306 

} 

307 

} 

308 

kptr  [  i  ]  =  knz  /K; 

//  similar  to  ptr[n  +  l] 

309 

310  #ifdef  VURBOZE  //  show  the  k— aligned  sizes  and  data 
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311  p ri  n  t  f  (”  words  :  n  ,  knz  =  %d,%d\n ”  , n  ,  knz  ) ; 

312  print  f(”\nbytes:  |kval  |=%d  .  |  kcol  |=%d  .  |  kptr  |=%d  ,  |  vec  |=%d\n  ”  , 

313  kvalsz  ,  kcolsz  ,  kptrsz  ,  xsz  ) ; 

314 

315  //  show  the  k— aligned  kval  .  and  kcol  vectors 

316  prin  t  f  (”  \  n%s%s\n”  , indent  ,kl); 

317  prin tf  (”%  s%s\n”  ,  indent  ,  r2  )  ; 

318  printf  (”%  s%s  ”  ,”kCSR :  ”,r3); 

319  for  (i=0;  i  <MIN(  knz  ,  3  *  head  ) ;  i+=4)  { 

320  printf  (f3  , 

321  i,kval[i],kval[i+l],kval[i+2],kval[i+3], 

322  kcol[i],kcol[i+l],kcol[i+2],kcol[i+3]); 

323  } 

324  printff”  %s\n”,spc); 

325 

326  //  show  the  k— aligned  kptr  array  (sideways) 

327  p  r  i  n  t  f  ( ”  \  n  i  :  ”  )  ; 

328  for  (i=0;  i  <=MIN(  n  ,  head  ) ;  i++)  { 

329  printf  (”%4d”  ,  i  ) ; 

330  } 

331  printf(”  %s\n”,spc); 

332  printf  (”  kptr:  ”); 

333  for  (i=0;  i  <=MIN(  n  ,  head  ) ;  i++)  { 

334  printf  (”%4d”  ,  kptr  [  i  ]) ; 

335  } 

336  printf  (”  %s\n”,spc); 

337  printf  (”  D_1  :  ”); 

338  for  (i=0;  i  <=MIN(  n  ,  head  ) ;  i++)  { 

339  printf  (”%6.2e”  ,D_1  [  i  ])  ; 

340  } 

341  printf  (”  %s\n”,spc); 

342  printf  (”%  s%s\n”  .indent  ,r3); 


343  # endif  //  #ifdef  VURBOZE 

344 

345  //  allocate  solution  and  constant  vector 

346  x  =  (  f  1  o  at  *)  Cache_Aligned_Allocate  (xsz); 

347  b  =  (  f  1  o  at  *)  Cache_Aligned_Allocate  (bsz); 

348 

349  if 

350 

351 

352  } 

353 

354  # endif  //  HWVER 

355 

356  //  calc  constant  vector  b  assuming  xcur  [  i  ]  =  1000.0 

357  //  we’ll  dump  the  b  vector  to  our  result  file 

358  //  also  calculate  2— norm  of  b  vector 

359  b2n_l  =  0.0; 

360  for  (i=0;  i<n;  i ++)  { 

361  b  [  i  ]  =  0.0; 

362  for  (j=ptr  [  i  ] ;  j<ptr[i+l];  j  ++)  {  //  sum(aij*xj) 

363  b[i]+=val[j]*1000.0; 

364  } 

365  b2n_l  +=  b[i]*b[i];  //  sum(b[i]~2) 

366  fprintf  (  fres  ,”b[%d]  =  %12. 10e\n”  ,  i  , b  [  i  ] ) ; 

367  } 

368  b2n_l  =  1/ sqrt  (  b2n_l  ) ;  //  (2  — norm  of  b  vector  )"(  —  1) 

369 

370  //  all  of  that  just  to  get  our  matrix  and  constant  vectors! 

371  //  now  we  do  the  software  or  hardware  version  of  eg 

372 

373  #ifndef  HWVER 


(  map.allocate  ( nmaps  ) )  {  //  allocate  the  MAP 

fpri  ntf  (  stderr  Could  not  allocate  MAP.  Bye!\n”); 
exit  ( 1 ) ; 


374 


//  SWVER:  do  the  eg  5  times  to  get  good  average 
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375 

376 

377 

378 

379 

380 

381 

382 

383 

384 

385 

386 

387 

388 

389 

390 

391 

392 

393 

394 

395 

396 

397 

398 

399 

400 

401 

402 

403 

404 

405 

406 


tl  =  usec();  //  first  stop  time,  second  start  time 

for  (k=0;k<5;k++)  {  //  run  it  5  times  to  get  a  good  average 
//  initialize  the  x  vector  to  all  zeros 
//  basically  we  start  ”  guessing”  at  x[i]  =  0 
for  (i=0;  i<n;  i ++)  { 
x  [  i  ]  =  0.0; 

} 

eg  (  val  ,  col  ,  ptr  ,  b  ,  n  ,  nnz  ,  x  ,  b2n_l  ,&r2nb2n  ,& i  t  e r  s  ) ;  //  solve  for  x 

} 

t2  =  usec();  //  second  stop  time,  third  start  time 

#else 

//  HWVER:  do  the  eg  5  times  to  get  good  average 
tl  =  usec();  //  first  stop  time,  second  start  time 

f or  ( k=0;k<5;k++)  {  //  run  it  5  times  to  get  a  good  average 
//  copy  data  to  the  OBCM  buffers 
gcm_cp_to  ( kvalA  ,  kval  ,  kvalsz); 
gcm_cp_to  ( kcolA  ,  kcol  ,  kcolsz); 
gcm_cp_to  ( kptrA  ,  kptr  ,  kptrsz); 
gcm_cp_to  (bA ,  b,  bsz); 

r2nb2n  =  EPSILON  +1;  //  assume  not  yet  converged 

eg  (kval  A  ,  kcol  A  ,  kptrA  ,bA,  n  ,  knz  ,xA,  b2n_l  ,&r2nb2n  ,  &  iters  ,  mapno  ) ; 

gcm_cp_from  (xA ,  x,  xsz);  //  copy  result  back  from  OBCM  buffer 

} 

t2  =  usee  ();  //  second  stop  time,  third  start  time 

#endif  //  HWVER 

//  save  results 

for  (i=0;  i<n;  i ++)  { 

fprintf  (  fres  ,  ”x[%d]  =  %12. 10e\n”  ,  i  ,  x  [  i  ] ) ; 

} 
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407 

408  //  close  output  file  and  release  memory 

409  fclose(fres  ); 

410 

411  free  (  val  ) ; 

412  free  (col  ) ; 

413  free  (  ptr  ) ; 

414 

415  //  release  all  the  stuff  that  was  allocated 

416  #ifndef  HWVER 

417  free  ( b  ) ; 

418  free(x); 

419  #else 

420  //  deallocate  the  MAP 

421  if  (  map.free  ( nmaps  ) )  { 

422  fpri  ntf  (  stderr  Could  not  deallocate  MAP.  Bye!\n”); 

423  e  x  i  t  ( 1 ) ; 

424  } 

425  //  arrays 

426  Cache  .Aligned  .Free  ((char*)kval); 

427  Cache  .Aligned  .Free  ((char*)kcol); 

428  Cache  .Aligned  .Free  ((char*)kptr); 

429  Cache  .Aligned  .Free  ((  char  *)D_1  ) ; 

430  Cache  .Aligned  .Free  ((char*)b); 

431  Cache  .Aligned  .Free  ((char*)x); 

432  //  OBCM  buffers 

433  gcm.free  ( kvalA  ) ; 

434  gcm.free  ( kcolA  ) ; 

435  gcm.free  ( kptr A  ) ; 

436  gcm.free  (D.1A  ) ; 

437  gcm.free  (bA) ; 


438 


gcm.free  (xA) ; 
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439  //  OBCM  segments 

440  gcm_free_seg  (  sd  1  ) ; 

441  gcm_free_seg  (  sd2  ) ; 

442  #endif  //  HWVER 

443 

444  //  report  to  stdout  and  quit 

445  t3  =  usec();  //  third  stop  time 

446  sprintf  (result  ,”%s  \t  %10d  \t  %6d  \t  %  12. 10  f  \t  %12.10f”, 

447  fname  ,  nnz  ,  iters  ,  r2nb2n  , 

448  (tl  -  tO)  +  ( t2  -  tl  )/5  +  ( t3  -  t2  ) ) ; 

449  puts(result); 

450  if  (  iters  <MAXTTERS)  { 

451  exit  (0) ; 

452  }  else  { 

453  sprintf  (re  suit,”***  Maximum  iterations  (%d)  exceeded  MAX1TERS) ; 

454  puts  (result  ); 

455  e  x  i  t  ( 1 ) ; 

456  } 

457  } 

A.1.2  cg.h 

458  /*  Cg.h 

459  J.  Hawkins,  A.  Alfarra  ,  and  G.  Morris,  Dec  2013 

460  prototype  for  sparse  matrix  conjugate  gradient 

461  */ 

462 

463  #include  <stdio.h> 

464  #include  < s t d  1  i b  . h> 

465 

466  #  d  e  f  i  n  e  HWVER 

467  #ifdef  HWVER 


468 


#include  <libmap.h> 
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469 

#  endif 

470 

471 

//  sometimes  MEN  and 

MAX  macros  already  defined 

472 

#ifndef  MIN 

473 

#define  MIN(x,y) 

(((x)<(y)?(x):(y))) 

474 

#define  MAX(x,y) 

(((x)>(y)?(x):(y))) 

475 

#endif 

476 

477 

//  width  of  the  binary 

tree  ,  maximum  n. 

478 

/ /  maximum  iterations 

> 

convergence  criterion 

479 

#define  K  (8) 

480 

#define  NMAX  (8192) 

481 

#define  MAXTTERS  (100000) 

482 

#define  EPSILON  (le- 

6) 

483 

484 

#  i  f  n  d  e  f  HWVER 

485 

486 

//  SWVER:  the  software  eg  iteration  function 

487 

void  cg( 

488 

float  val  []  , 

// 

CSR  sparse  matrix  values 

489 

int  col  []  , 

// 

CSR  column  indices 

490 

int  ptr []  , 

// 

CSR  row  pointers 

491 

float  b  []  , 

// 

known  constant  vector 

492 

int  n , 

// 

matrix  order 

493 

int  nnz  , 

// 

number  of  non— zeros 

494 

float  x  []  , 

// 

solve  for  x 

495 

float  b2n_l  , 

// 

(2  — norm  of  b  vector  )"(  — 1 ) 

496 

float  *r2nb2n  , 

// 

ratio  of  2— norms  (convergence  test) 

497 

int  *  iters 

// 

iterations  to  solve  or  give  up 

498  ) ; 

499 

500  #  e  1  s  e 
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501 

502  //  HWVER:  the  hardware  eg  iteration  function 

503  void  cg( 

504  gcm_addr_t  kvalA  ,  //  GCM  address  of  CSR  matrix  values 

505  gcm_addr_t  kcolA  ,  //  GCM  address  of  CSR  column  indices 

506  gcm_addr_t  kptrA  ,  //  GCM  address  of  CSR  row  pointers 

507  gcm_addr_t  bA,  //  GCM  address  of  constant  vector 

508  int  n,  //  number  of  matrix  rows  and  columns 

509  int  knz  ,  //  k— aligned  number  of  non— zeros 

510  gcm_addr_t  xA,  //  GCM  address  of  approximate  solution 

511  float  b2n_l  ,  //  (2  — norm  of  b  vector)"(  — 1) 

512  float  *r2nb2n  ,  //  ratio  of  2— norms  (convergence  test) 

513  int  liters  ,  //  iterations  needed  to  converge  (or  give  up) 

514  int  mapnum);  //  MAP  number  required  by  Carte 

515 

516  //  cache  alignment  (%  more  /  proc  /  epuinfo  ) 

517  #define  CACHEALIGN  (128) 

518 

519  //  return  next  integer  multiple  of  sz  >=  x 

520  #  define  SALIGN(x,  sz)  ( ( ( (  x)%(  sz  ))  =  =  0)  ?((  x  ) ) :  ( (  x )  +  (  sz )  -((x)%(  sz  ) ) ) ) 

521 

522  #endif  //  HWVER 

A.1.3  cg.c 


523 

/* 

eg  .  c 

524 

J  .  Hawkins  ,  A.  Alfarra  ,  and 

G.  Morris  ,  Dec  2013 

525 

software  version  of  sparse 

matrix  conjugate  gradient 

526 

*/ 

527 

#include 

<  s  t  d 1 i b  . h> 

528 

#include 

<math  .  h> 

529 

#include 

”cg  .  h” 

530 
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531 

void  cg( 

532 

float  val  []  , 

// 

CSR  sparse  matrix  values 

533 

int  col  []  , 

// 

CSR  column  index 

534 

int  ptr  []  , 

// 

CSR  row  pointer 

535 

float  b  []  , 

// 

known 

constant  vector 

536 

int  n , 

// 

matrix 

order 

537 

int  nnz  , 

// 

number 

of  non— zeros 

538 

float  x  []  , 

// 

solve 

for  x 

539 

float  b2n_l  , 

// 

(2  — norm  of  b  vector  )"(  — 1 ) 

540 

float  *r2nb2n  , 

// 

ratio 

of  2— norms  (convergence  test) 

541 

int  *  iter s  )  { 

// 

iterations  to  solve  or  give  up 

542 

543 

/*  x _ 0  =  0  so  Ax_0  =  0 

,  ergo  vector  Ax  not  needed 

544 

float  *Ax; 

//  for  initial  search 

direction 

545 

float  *p ; 

//  search  direction 

546 

float  *r  ; 

//  residual  vector 

547 

float  * v_ap  ; 

//  avoid  multiple  Ap  mvm 

548 

float  alpha  ; 

//  step  size 

549 

float  rTrold 

; 

//  dot  (  r  (k  — 1) ,  r  (k  — 1)) 

550 

float  rTrnew 

//  dot ( r (k) , r (k)) 

551 

float  beta  ; 

//  projection  operator 

552 

float  normr  ; 

//  2— norm  of  r  ( |  |  r  |  | ) 

553 

float  delta  ; 

//  residual  norm  =  ||r 

l|/||b|| 

554 

int  i  t  e  r  s 1 c  1 

= 

1; 

//  local  iteration  index 

555 

float  sum; 

//  multi-use  accumulator 

556 

int  i  , j ; 

//  index  variables 

557 

558 

/  /  allocate 

vectors 

559 

/*  x_0  =  0  so  Ax_0  =  0 

,  ergo  no  vector  needed 

560 

Ax  -  (float 

*)  malloc  (n*  sizeof  (  f  1  o  at  ) ) ;  */ 

561 

p  =  (float  * 

)malloc(n*sizeof(  float  )); 

562 

r  =  (float  * 

)  malloc  ( n* 

sizeof  (float  ) ) ; 
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563  v_ap  =  (float  *)  malloc  ( n*  s  i  ze  of  (  f  1  o  a  t  ) ) ; 

564 

565  //  calculate  Ax_0 

566  /*  x_0  =  0  so  Ax_0  =  0,  ergo  no  calculation 

567  for  ( i  =0;  i<n  ;  i ++)  { 

568  sum  =  0;  //  recall  A  is  sparse 

569  for  ( j  =  ptr  [  i  ] ;  j<ptr  [  i  +1] ;  j  ++)  { 

570  sum  +=  val[j]*x[col[j]]; 

571  } 

572  Ax[i]  =  sum; 

573  } 

574  */ 

575 

576  //  algorithms  lines  2, 3, 5, and  7 

577  //  recall  Ax_0=0  so  pl  =  r0=b 

578  //  1  st  A— orthogonal  search  direction  (pi) 

579  //  0th  residual  ( rO  =  pi) 

580  //  rTrold  =  dot  ( rO  ,  rO  ) 

581  rTrold  =  0; 

582  for  ( i  =0;  i<n  ;  i ++)  { 

583  x[i]  =  0;  //  want  x_0  =  0 

584  p[i]=b[i];  //  pi  =  b— Ax_0 

585  r[i]  =  p  [  i  ] ;  //  rO  =  pi 

586  rTrold  +=  r[i]*r[i];  //  dot  (  r  ( 0)  ,  r  ( 0 ) ) 

587  } 

588 

589  delta  =  EPSILON  +  1.0;  //  force  loop  entry 

590  //  loop  until  converge  or  max  iters  exceeded 

591  while(  (delta  >EPSILON)&&(  i  t  e  r  s  1  c  1  <MAXTTERS ) )  { 

592 

593  //  alg  line  10:  need  Ap  twice  ,  so  calc  once 

594  for  ( i  =0;  i<n  ;  i ++)  { 
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595 

596 

597 

598 

599 

600 

601 

602 

603 

604 

605 

606 

607 

608 

609 

610 

611 

612 

613 

614 

615 

616 

617 

618 

619 

620 

621 

622 

623 

624 

625 

626 


sum  =  0;  //  recall  A  is  sparse 
f  or  ( j  =  p  t  r  [  i  ] ;  j  <p  tr  [  i  + 1  ] ;  j  ++)  { 
sum  +=  val[j]*p[col[j]]; 

} 

v_ap  [  i  ]  =  sum  ; 

} 

//  alg  11:  calc  step  size  (alpha) 
sum  =  0; 

for  ( i  =0;  i<n  ;  i  ++)  { 

sum  +=  p  [  i  ]  *  v_ap  [  i  ] ; 

} 

alpha  =  rTrold/sum; 

//  alg  12:  calc  next  x  and 

//  alg  16:  residual  vector  and 

//  alg  1  8 :  dot  ( r  ,  r  ) 

rTrnew  =  0; 

for  ( i  =0;  i<n  ;  i  ++)  { 

x [  i  ]  +=  alpha  *p  [  i  ] ; 
r[i]  — =  alpha  *  v_ap  [  i  ] ; 
rTrnew  +=  r[i]*r[i]; 

} 

normr  =  sqrt  ( rTrnew );  //  ||r||,  part  of  alg  22 

beta  =  rTrnew  /  rTrold  ;  //  alg  19:  projection  operator 

rTrold  =  rTrnew;  //  alg  20:  for  next  iteration 

//  alg  21:  calc  next  search  direction 
for  ( i  =0;  i<n  ;  i  ++)  { 

p [ i ]  =  r [ i ]+ beta *p [ i ] ; 
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627 

} 

628 

629 

delta  =  normr  * 

b2n_l  ; 

// 

alg  22:  residual  norm 

630 

i  t  e  r  s  1  c  1  ++; 

// 

next  iteration 

631 

632 

}  //  end  while 

633 

634 

*  iters  =  iterslcl  ; 

// 

report  iterations 

635 

*r2nb2n  =  delta ; 

// 

and  residual  norm 

636 

637 

//  release  vectors 

638 

/*  x _ 0  =  0  so  Ax_0 

-  0,  ergo 

no  Ax 

vector 

639 

f  r  e  e ( Ax ) ;  *  / 

640 

free (p ) ; 

641 

free  (  r  ) ; 

642 

free ( v_ap ) ; 

643 

644 

}  //  end  eg 

A.  1.4  eg. me 


645  //  eg. me 

646  //  sparse  matrix  CG  iterative  solver,  Jerry  Morris  &  Jamory  Hawkins,  Jan  2014 

647  //  partly  based  on  Morris’  April  2011  Jacobi  solver 

648  //  this  is  the  hardware  version  that  runs  on  the  SRC— 7  MAP 

649  // 

650  //  build:  make  debug  (debug  version  :  ./cghw.dbg) 

651  //  make  hw  (hardware  version  :  ./cghw) 

652  // 

653  //  algorithm  cghw  ( kval  ,  kcol  ,  kptr  ,  b  ,  b2  ,  n  ,  knz  ,  x ) 

654  //  //  kval  ,  kcol  ,  kptr  :  CSR—  format  A  matrix 

655  //  //  b:  constant  vector,  b2  =  l/||b|| 

656  //  //  n:  matrix  row  &  cols  ,  knz:  #  non— zeros 
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657  //  //  x:  desired  solution,  i.e.,  Ax=b 

658  // 

659  //  //  place  data  into  OBM  and  BRAM 

660  //  //  only  have  two  GCM  banks  so 

661  //  //  parallel  DMA's  limited  to  2 

662  //  parBegin 

663  //  BUF_DMAGCM1  :OBM  (kval,  stripe  —8) 

664  //  BUF_DMAGCM2 :OBM  (kcol,  stripe  -8) 

665  //  parEnd 

666  //  parBegin 

667  //  STREAMDMAGCMl : BRAM  (  k p t r  ) 

668  //  STREAMDMAGCM2 : BRAM  (b) 

669  //  parEnd 

670  // 

671  //  //  set  up  for  repeat  loop 

672  //  for  i  in  [l,n]  do  //  x0=0,  so  pl:  =  r0:  =  b 

673  II  x [  l  ]  :  =  0 

674  //  p[i]:=r[i]:=b[i] 

675  //  MAC(  r  [  i  ]  ,  r  [  i  ]  ,  rTrold  )  //  calculate  dot(r0,r0) 

676  //  end  for 

677  // 

678  //  d:=0  //  loop  index 

679  // 

680  //  repeat 

681  // 

682  //  //  11  copies  ofp  vector 

683  //  //  to  avoid  loop  — carried  dependence 

684  //  //  8  for  dot— product  tree 

685  //  //  3  elsewhere 

686  //  pl:  =  p2:=  ...  :  =  p  1 1  :  =  p 

687  // 


688  // 


//  compute  v  :=  Ap  via  streaming  accumulator 
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689 

// 

parBegin 

690 

// 

691 

// 

//  dot8s  into  Vfifo  stream 

692 

// 

pi:  for  i  in  [l,knz]  do 

693 

// 

a8  :=  (al  ...  a8 )  stripe —8 

from  kval  [  i  ] 

694 

// 

j8  :=  ( j  1  ...  j  8  )  stripe-8 

from  kcol  [  i  ] 

695 

// 

P8  :  =  Cpl [ j 1 ]  p  8  [ j 8 ] ) 

696 

// 

V fifo  :=  dot  ( a8  ,  p8  ) 

697 

// 

end  for 

698 

// 

699 

// 

//  #  dot8s  per  row  into  Cfifo  stream 

700 

// 

p2 :  for  i  in  [  1  ,  n ]  do 

701 

// 

Cfifo  :=  kptr[i+l]  —  kptr  [  i 

] 

702 

// 

end  for 

703 

// 

704 

// 

//  n  dot  products  into  Sfifo 

705 

// 

p3  :  Sfifo  :=  Stream.Accum  ( Vfifo 

,  Cfifo) 

706 

// 

707 

// 

//  dot [ i  ]=  dot (A [ i ] , p) 

708 

// 

p4 :  for  i  in  [  1  , n ]  do 

709 

// 

v [  i  ]  :=  Sfifo 

710 

// 

MAC(  v  [  i  ]  ,  p9  [  i  ]  ,  pTv)  //  dot(p.v) 

711 

// 

end  for 

712 

// 

713 

// 

parEnd 

714 

// 

715 

// 

alpha  :=  rTrold /pTv  //stepsize 

716 

// 

717 

// 

for  i  in  [  1  , n ]  do 

718 

// 

x  [  i  ]  :=  x[i]  +  alpha*pl0[i]  // 

next  point 

719 

// 

r  [  i  ]  :=  r[i]  alpha*v[i]  // 

residual 

720 

// 

MAC(  r  [  i  ]  ,  r  [  i  ]  ,  rTrnew  )  // 

new  dot  ( r  ,  r  ) 
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721 

// 

722 

// 

723 

// 

724 

// 

725 

// 

726 

// 

727 

// 

728 

// 

729 

// 

730 

// 

731 

// 

732 

// 

733 

// 

734 

// 

735 

// 

736 

// 

737 

// 

738 

// 

739 

// 

740 

// 

741 

// 

742 

// 

743 

// 

744 

// 

745 

// 

746 

// 

747 

// 

748 

// 

749 

// 

750 

// 

751 

// 

752 

// 

end  for 

beta  :=  rTrnew  /  rTrold  //  projection  operator 

rTrold  :=  rTrnew  //  for  next  iteration 

for  i  in  [1,  n]  do  //  new  search  direction 
p [ i ]  :=  r[i]  +  beta*pll[i] 
end  for 

//  residual  norm 

r2b2:=  sqrt  (  rTrold  )*  b2  //  ||r||=  sqrt(rTrold) 

d:  =  d+l  //  next  iteration 

until  (r2b2  <=  epsilon  .OR.  d  >  dmax) 

//  all  done,  stream  x  back  to  GCM 
STREAM J3MABRAM : GCM2  (x) 

end  algorithm 

Derivation  of  residual  norm  termination  condition 

Recall  ,  the  2— norm  of  n— vector  (say)  y  is  given  by 
I  I  y  I  I  =  sqrt  (sum(  i  =1  ,.n)(y  [  i  ]A2)) 

rcur  =  A( xact  —  xcur)  — > 

|  |  xact  —  xcur  |  |  =  j  |  AA(  —  1)*  rcur  |  |  <=  |  |  AA (  —  1)  |  |  *  |  |  rcur  |  |  [5] 

b  =  A*  xact  — >  ||b||  =  |  |  A*  xact  |  |  <=  ||A||*||xact||  — > 

1  /  |  |  xact  |  |  <=  |  |  A  |  |  /  |  |  b  |  |  [6] 
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753  // 

754  //  multiply  [5]  and  [6] 

755  //  |  |  xact  —  xcur  |  |  /  |  |  xact  |  |  <=  ||A||*||A"(—  l)||*(||rcur||/||b||) 

756  // 

757  //  condition  number,  k(A)  =  |  |  A  |  |  *  |  |  A~(  —  1 )  |  |  =  lambda.max  /  lambda.min 

758  //  where  the  lambda  are  max  and  min  eigenvalues  of  matrix  A. 

759  //  In  practice  ,  finding  the  eigenvalues  is  at  least  as  hard  as  solving 

760  //  the  set  of  equations  ,  furthermore  l<=k(A)<  infinity  , 

761  //  so  we  can  simply  use  the  ratio  of  2— norms 

762  //  as  our  termination  ,  i.e.,  we  want 

763  // 

764  //  |  |  rcur  |  |  /  |  |  b  |  |  <  epsilon 

765  // 

766  //  This  is  the  k— aligned  hardware  version.  There  is  an  option  for 

767  //  a  pure  software  version  of  eg.  The  versions  are  selected 

768  //  via  a  variable  that  is  found  in  the  eg .  h  header  file 

769  // 

770  # include  <libmap.h> 

771  #  include  ”cg.h” 

772  void  cg( 

773  gcm_addr_t  kvalA  ,  //  GCM  address  of  CSR  matrix  values 

774  gcm_addr_t  kcolA  ,  //  GCM  address  of  CSR  column  indices 

775  gcm_addr_t  kptrA  ,  //  GCM  address  of  CSR  row  pointers 

776  gcm_addr_t  bA,  //  GCM  address  of  constant  vector 

777  int  n,  //  number  of  matrix  rows  and  columns 

778  int  knz  ,  //  k— aligned  number  of  non— zeros 

779  gcm_addr_t  xA,  //  GCM  address  of  approximate  solution 

780  float  b2n_l  ,  //  (2  — norm  of  b  vector  )"(  —  1) 

781  float  *r2nb2n  ,  //  ratio  of  2— norms  (convergence  test) 

782  int  liters  ,  //  iterations  needed  to  converge  (or  give  up) 

783  int  mapnum)  {  //  MAP  number  required  by  Carte 
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785  //  k— striped  local  copy  of  k— aligned  matrix  values  (kval)  in  OBM 

786  //  2— pack  (0,1) 

787  //  2— pack  (2,3) 

788  //  2-pack  (4,5) 

789  / /  2— pack  (6,7) 

790  OBM_BANK_A(  kvalLOl  ,  int64_t  ,  MAX.OBM.SIZE) 

791  OBM_BANK_B(  kvalL23  ,  int64_t  ,  MAX.OBM.SIZE) 

792  OBM_BANK_C( kvalL45  ,  int64.t  ,  MAX.OBM.SIZE) 

793  OBM_BANK_D(  kvalL67  ,  int64.t  ,  MAX.OBM.SIZE) 

794 

795  //  k— striped  local  copy  of  k— aligned  matrix  columns  (kcol)  in  OBM 

796  //  2— pack  (0,1) 

797  //  2— pack  (2,3) 

798  //  2— pack  (4,5) 

799  //  2— pack  (6  ,7) 

800  OBM_BANK_E(  kcolLOl  ,  int64_t  ,  MAX.OBM.SIZE) 

801  OBM_BANK_F(  kcolL23  ,  int64_t  ,  MAX.OBM.SIZE) 

802  OBM_BANK_G(  kcolL45  ,  int64.t  ,  MAX.OBM.SIZE) 

803  OBM_BANK_H(  kcolL67  ,  int64_t  ,  MAX.OBM.SIZE) 

804 

805  //  all  other  vectors  go  into  BRAM  on  the  Altera  FPGA 

806  //  we  play  some  games  to  balance  use  of  the  BRAM  memory. 

807  //  EP2S180  FPGA  has  768  M4K  RAM,  930  M512  RAM,  and  9  MRAM  blocks  . 

808  //  we  consume  662  of  768  M4K  BRAMs,  81  of  930  M512  BRAMs, 

809  //  and  9  of  9  MRAM  blocks  .  Basically  we  declare  some  arrays 

810  //  bigger  than  they  need  to  be  so  that  Carte  puts  them  in 

811  //  MRAM  rather  than  M4K.  see  SRC— 7  Carte  C  Programming 

812  //  Environment  Guide  for  additional  details. 

813 

814  int  kptrL  [2*NMAX] ;  //  local  copies  of  kptr 

815  float  bL[NMAX];  //  local  copy  of  b 

816  float  r  [NMAX] ;  //  residual  vector 
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817 

float 

v  [NMAX] ; 

// 

matrix  vector  multiply  :  v  =  Ap 

818 

float 

alpha  ; 

// 

step  size 

819 

float 

beta  ; 

// 

projection  operator 

820 

float 

rT  rold ; 

// 

previous  residual  dot(r,r) 

821 

float 

rTrnew  ; 

// 

current  residual  dot(r,r) 

822 

float 

pTv ; 

// 

dot (p , v) 

823 

float 

ri  ; 

// 

avoid  mixed  read /write 

824 

825 

//  the 

dot8  tree  needs 

independent  copies  of  p 

826 

//  to 

avoid  multiple 

—cycle  memory  accesses 

827 

float 

pCO  [NMAX] ; 

//  the  p  copy  (pC*)  are  the 

828 

float 

pCl  [NMAX] ; 

//  p  inputs  for  each  leaf  node 

829 

float 

pC2  [NMAX] ; 

//  of  a  size  k=8  dot  product  unit 

830 

float 

pC3[2*NMAX] ; 

//  the  dot8  is  fully  pipelined 

831 

float 

pC4[2*NMAX] ; 

//  the  array  A  inputs  to  the  dot8 

832 

float 

pC5[2*NMAX] ; 

//  come  from  the  striped  OBM  banks 

833 

float 

pC6[2*NMAX] ; 

834 

float 

pC7[2*NMAX] ; 

835 

836 

float 

pC8  [NMAX] ; 

//  we  need  3  additional  copies  of  p 

837 

float 

pC9  [NMAX] ; 

//  for  other  calculations  that 

838 

float 

pCa  [NMAX] ; 

//  access  the  p  vector 

839 

840 

float 

pnxt  [NMAX] ; 

//  we  need  this  to  calculate  the  next 

841 

float 

xL[2*NMAX] ; 

//  local  copy  for  the  x  calculations 

842 

843 

//  input  vector  stream 

processing  is  as  follows  : 

844 

//  i. 

DMAstream  in 

264- 

-bit  packed  values  from  GCM 

845 

//  2. 

stream  width 

convert  into  32— bit  stream 

846 

//  4. 

unload  single 

32- 

-bit  stream  into  local  BRAM  arrays 

847 

Stream_256  Skptr256  ; 

//  kptr  streams 

848 


Stream_32  Skptr32; 
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849  Stream_256  Sb256 ;  //  b  streams 

850  Stream_32  Sb32  ; 

851 


852 

// 

the 

streaming  accumulator  bears  explanation 

853 

// 

854 

// 

suppose  I  have  3  vectors:  a  =  (l,2,3),  b  =  (4,5),  c  =  (6,7,8,9) 

855 

// 

and 

I  want  the  three  vector  sums:  sa  =  l+2+3=6,  sb=9,  s c  =  3 0 

856 

// 

further  suppose  the  vector  data  is  being  delivered  as 

857 

// 

a  single  stream,  vals  =  1  ,2  ,3  ,4  ,5  ,6  ,7  ,8  ,9 

858 

// 

and 

I  have  another  stream,  cnts  =  3,2,4  (sizes  of  a,  b,  c) 

859 

// 

I  use  the  streaming  accumulator  to  produce 

860 

// 

a  third  stream,  sums  =  6,9,30. 

861 

// 

Here 

’s  an  elided  pseudo— code  of  the  four  parallel  sections 

862 

// 

that 

one  would  use  to  accomplish  this  task 

863 

// 

864 

// 

pars 

{ 

865 

// 

par 

{  //  stream  out  values 

866 

// 

for  i  =1  ,9  { 

867 

// 

put.stream  (&Sva  ,  vals[i]) 

868 

// 

} 

869 

// 

} 

870 

// 

par 

{  //  stream  out  counts 

871 

// 

for  i  =1  ,3  { 

872 

// 

put_stream(&Sca  ,  cnts  [  i  ])) 

873 

// 

} 

874 

// 

par 

{  //  eat  vals  ,  cnts  and  stream  out  sums 

875 

// 

stream_fp_accum_strm_counts_3  2_rr_term(&Sva,&Sca,&Ssa) 

876 

// 

} 

877 

// 

par 

{  //  save  sums 

878 

// 

for  i  =1  ,3  { 

879 

// 

get_stream(&Ssa  ,&sums  [  i  ] ) ; 

880 

// 

} 
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881 

882 

883 

884 

885 

886 

887 

888 

889 

890 

891 

892 

893 

894 

895 

896 

897 

898 

899 

900 

901 

902 

903 

904 

905 

906 

907 

908 

909 

910 

911 

912 


//  } 

// 

//  in  our  case  accumulator  stream  processing  is  as  follows: 

//  1.  compute  dot8  patial  dot  products:  stream  dot8  vals  to  accum 

//  2.  compute  number  of  dot8s  per  row:  stream  cnts  to  accum 

//  3.  accum  sums  dot8s  &  outputs  n  full  dot  products  (dotNs) 

//  4.  complete  processing 
// 

Stream_32  Sva;  //  value  stream  (dot8s)  to  accum 

Stream_32  Sea;  //  count  stream  to  accum  (#  dot8s  per  row) 

Stream_32  Ssa;  //  sums  (dotNs)  from  accum 

//  output  stream  processing 

Stream_32  Sx32 ;  //  the  output  stream  for  x 

Stream_64  Sx ;  //  packed  result  stream 

//  column  numbers  allow  match  of  A[i][j]  with  p  [  j  ] 
int  jO  ,jl  ,j2  ,  j  3  ,j4  ,j5  ,j6  ,  j  7  ; 

//  A  and  p  inputs  to  the  binary  tree  dot  product  unit 
float  aO  ,  al  ,  a2  ,  a3  ,  a4  ,  a5  ,  a6  ,  a7  ; 

float  pO  ,  pi  ,  p2  ,  p3  ,  p4  ,  p5  ,  p6  ,  p7  ; 

//  dot  product  tree  multiplier  node  and  adder  node  outputs 
float  mnO ,  mnl ,  mn2 ,  mn3  ,  mn4 ,  mn5  ,  mn6 ,  mn7  ; 
float  anO  ,  anl  ,  an2  ,  an3  ,  an4  ,  an5  ,  dot8  ; 

//  #  bytes  :  kval  ,  kcol  ,  kptr  ,  b,  and  x 
int  kvalsz  ,kcolsz  ,kptrsz  ,bsz  ,xsz; 

float  r2nb2nL  ;  //  local  copy  of  residual  norm 

int  err;  //  needed  by  the  MAC  IP  cores 


76 


92 


913 

914 

915 

916 

917 

918 

919 

920 

921 

922 

923 

924 

925 

926 

927 

928 

929 

930 

931 

932 

933 

934 

935 

936 

937 

938 

939 

940 

941 

942 

943 

944 


int  itersL  ;  //  number  of  iterations 

i n  t  i  ,  j  ;  //  loops 

//  number  of  bytes  to  be  transferred 

//  use  SALIGN  macro  (cg.h)  to  avoid  cache  boundary  problems 

kvalsz  =  SALIGN  (knz*sizeof(float)  ,  CACHEALIGN ) ; 

kcolsz  =  SALIGN(knz*sizeof  (  int  ),  CACHEALIGN); 

kptrsz  =  SALIGN((n  +  l)*  sizeof(  int  ), CACHEALIGN); 

xsz  =  SALIGN(n*sizeof  (  float  ),  CACHEALIGN); 

bsz  =  SALIGN(n*sizeof  (  float  ),  CACHEALIGN); 

//  grab  kval  and  kcol  (in  different  GCM  banks,  per  main.c) 
#pragma  src  parallel  sections 
{ 

#pragma  src  section 

{  //  stripe  packed  kval  vector  from  GCM  across  four  OBMs 

buffered_dma_gcm  (GCM20BM,  PATH_0 ,  kvalLOl  , 

MAP_OBM_stripe  (1  ,”A,B,C,D”)  ,  kval  A  ,  1  ,  kvalsz  ) ; 

} 

#pragma  src  section 

{  //  stripe  packed  kcol  vector  from  GCM  across  four  OBMs 

buffered_dma_gcm  (GCM20BM,  PATH.l  ,  kcolLOl  , 

MAP_OBM_stripe  (1  ,”E,F,G,H”)  ,  kcolA  ,  1  ,  kcolsz  ); 

}  //  section 

}  //  sections 

//  grab  kptr  and  the  b  vector  using  streams 
//  located  in  different  GCM  banks,  per  main.c 
#pragma  src  parallel  sections 
{ 

#pragma  src  section 

{  //  stream  DMA  in  packed  kptr  vector  from  OBCM 
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945 
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947 
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950 

951 
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953 

954 
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956 
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958 

959 

960 

961 
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963 

964 
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969 
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971 

972 

973 

974 

975 

976 


streamed _dma_gcm_256 (&Skptr256  , PORT_TO_STREAM , 

PATH_0,  kptrA  ,1  ,(int64_t)kptrsz); 
stream_term(&Skptr256  ) ;  //  term  required  by  width  converter 

} 

#pragma  src  section 
{  //  convert  to  32— bit  stream 

stream_width_256to32_term(&Skptr256,&Skptr32); 

} 

#pragma  src  section 

{  //  put  kptr  32— bit  stream  into  BRAM 

int  curr  ;  //  current  kptr  value 
int  i  ,  j  ; 

for  (i=0;  i<kptrsz  /  sizeof  (  int  ) ;  i ++)  {  //  pipeline  depth:  5 
get_stream(&Skptr32,&curr  ); 

//  ordering  is  1  ,0  ,3  ,2  ,5  ,4  . 

j  =(  i  %2)?i  — 1:  i +1 ;  //  fix  the  ordering  problem 

kptrL  [  j  ]  =  curr  ; 

} 

} 

#pragma  src  section 

{  //  stream  DMA  in  packed  b  vector  from  OBCM 

streamed  _dma_gcm_256  (&Sb256  ,  PORT_TO_STREAM , 

PATEL  1  ,bA,  1  ,(  int64_t)bsz); 

stream.term (&Sb256 ) ;  //  term  required  by  width  converter 

} 

#pragma  src  section 
{  //  convert  to  32— bit  stream 

stream_width_25  6to3  2_term(&Sb256  ,&Sb32 ) ; 

} 

#pragma  src  section 

{  //  put  b  vector  32— bit  stream  into  BRAM 


float  curr  ; 
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993 

994 

995 

996 

997 
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int  i  ,  j  ; 

for  (i=0;  i<bsz  /  sizeof  (  float  ) ;  i ++)  {  //  pipeline  depth:  5 
get_stream_flt  (&Sb32,&curr  ); 

//  ordering  is  1  , 0  , 3  , 2  , 5  ,4  , . . . 
j  =(  i  %2)?i  — 1:  i  +1 ; 
bL [ j ]  =  curr ; 

} 

}  //  section 

}  //  sections 

| - 

//  now  we  have  all  the  data  ,  let  ’s  do  the  eg  iteration 
//  use  streaming  accumulator  to  sum  partial  dot  products 
//  we  iterate  till  we  converge  or  exceed  maximum  iterations 

| - 

//  do  the  pre— loop  stuff 
//  x0=0 ,  r0=b ,  pl=b 
//  and  precalculate  dot(r0,r0) 

for  ( i  =0 ;  i<n;  i++)  { 

xL [ i ] =0 ; 
r i =bL [ i  ]; 
pnxt [ i ]=  ri ; 
r  [  i ]  =  ri  ; 

//  use  to  calculate  dot(r,r) 

//  needed  for  convergence  test 

fp_mac_32(ri  ,  ri  ,  (  i==n  —  1 ) ,  1  ,( i  ==0)  ,& rTrold  ,& err  ) ; 

//  A  ~  |  | 

//  r[i]  - +  |  I  I  I  I  I 

//  r  [  i  ]  - +  I  I  I  I  I 

//  terminate?  - +  III  I 

//  accumulate?  - h  | 
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80 


1009 

//  reset  ? 

- H  1  I 

1010 

//  output 

dot  ( r  ,  r  )  <- 

- +  I 

1011 

//  output 

errors  < — 

- h 

1012 

} 

1013 

1014 

r2nb2nL  =  EPSILON+1;  // 

force 

loop  entry 

1015 

for  (  itersL  =0; 

( (  itersL  <  MAXITERS)&&(r2nb2nL  >  EPSILON));  itersL++) 

1016 

1017 

//  copy  pnxt  to  pC* 

for  the  current  iteration 

1018 

for  (i=0; 

i<n;  i++) 

{  // 

pipeline  depth  :  5 

1019 

pCO [ i ] 

-  pnxt [  i  ] 

;  // 

each  leaf  node  multiplier  of  the 

1020 

pCl[i] 

-  pnxt  [  i  ] 

;  // 

binary  tree  needs  a  different 

1021 

pC2 [ i ] 

-  pnxt  [  i  ] 

;  // 

value  from  the  p  vector  . 

1022 

pC3 [ i ] 

-  pnxt  [  i  ] 

;  // 

since  the  BRAMs  are  single —ported  , 

1023 

pC4 [ i ] 

-  pnxt [  i  ] 

;  // 

we  need  8  independent  copies  of  p 

1024 

pC5 [ i ] 

-  pnxt [  i  ] 

;  // 

to  avoid  an  8— cycle  loop  — carried 

1025 

pC6[ i ] 

-  pnxt  [  i  ] 

;  // 

dependence . 

1026 

pC7 [ i ] 

-  pnxt [  i  ] 

1027 

1028 

pC8 [ i ] 

-  pnxt  [  i  ] 

;  // 

to  calculate  dot(p,v) 

1029 

pC9 [ i ] 

-  pnxt  [  i  ] 

;  // 

to  calculate  next  x 

1030 

pCa [ i ] 

=  pnxt [  i  ] 

;  // 

to  calculate  next  search  direction 

1031 

} 

1032 

1033 

//  compute 

v  =  Ap 

1034 

#pragma  src  parallel 

sections 

1035 

{ 

1036 

#pragma  src  section 

1037 

{  //  send  the  dot8 

vals 

to  accumulator 

1038 

int  i  ; 

1039 

1040 

for  ( i 

=0;  i<knz/K:  i++)  {  //  pipeline  depth:  63 
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1043 

1044 

1045 

1046 

1047 

1048 

1049 

1050 

1051 

1052 

1053 

1054 

1055 
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1072 


//  all  k— groups  in  each  row 
//  grab  next  k  matrix  values  (a8  values) 
s  p  1  i  t  _6  4 1  o  3  2  _f  1 1  _ f  1 1  (  kvalLO  1  [  i  ]  ,&al  ,&a0  ) ;  //  2  each 
s  p  1  i  t  _6  4 1  o  3  2  _f  1 1  _ f  1 1  (  kvalL23  [  i  ]  ,&a3  ,&a2  ) ;  //  unpack 
s p  1  i  t  _6 4 1  o  3  2  _f  1 1  _f  1 1  (  kvalL45  [  i  ]  ,&a5  ,&a4  ) ;  //  from 
split_64to32_flt_flt  (  kvalL67  [  i  ]  ,&a7  ,&a6  ) ;  //  OBM 

//  grab  next  k  column  pointers  (j8  columns) 
s p  1  i  t _64 1 o 3  2  ( kcolLO  1  [  i  ]  ,&j  1  ,&j 0  ) ;  //  2  each 

split_64to3 2  ( kcolL23  [  i  ]  ,&j3  ,&j2  ) ;  //  unpack 

split_64to3 2  ( kcolL45  [  i  ]  ,&j5  ,&j4  ) ;  //  from 

split_64to3  2(  kcolL67  [  i  ] , & j 7  ,& j 6  ) ;  //  OBM 


pO 

=  pCO [ j  0 ] ; 

// 

use  the  j8  column  pointers 

pi 

=  pCl  [j  1  ] ; 

// 

to  grab  matching  p’s  (p8) 

p2 

=  pC2 [ j  2 ] ; 

// 

remember  ,  we  have  k 

p3 

=  pC3 [ j  3 ] ; 

// 

copies  of  p 

p4 

=  pC4 [ j  4 ] ; 

// 

to  avoid  memory— based 

p5 

=  pC5 [ j  5 ] ; 

// 

loop  — carried  dependence 

p6 

=  pC6 [ j  6 ]; 

P? 

=  pC7 [ j  7 ] ; 

//  binary  tree  pipeline  for  dot8 
//  recall  ,  this  delivers  a  result 
//  every  clock  cycle  (after  the  latency) 
// 

II  aO  ,  pO — >*mn0 
//  >+an0 

//  al  ,pl — >*mnl  \ 

//  >+an4 

II  a2  ,  p2 — >*mn2  /  \ 

//  >+anl  \ 
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97 


1073 

// 

a3  ,  p3 — >*mn3 

\ 

1074 

// 

>+dot8 — > 

1075 

// 

a4  ,  p4 — >*mn4 

/ 

1076 

// 

>+an2 

/ 

1077 

// 

a5  ,  p5 — >*mn5 

\  / 

1078 

// 

>+an5 

1079 

// 

a6  ,  p6 — >*mn6 

/ 

1080 

// 

>+an3 

1081 

// 

a7  ,  p7 — >*mn7 

1082 

1083 

mnO 

=  aO  *  pO ; 

//  leaf  nodes  (multiplier 

1084 

mnl 

=  al  *  pi  ; 

1085 

mn2 

=  a2  *  p2 ; 

1086 

mn3 

=  a3  *  p3 ; 

1087 

mn4 

=  a4  *  p4 ; 

1088 

mn5 

=  a5  *  p5  ; 

1089 

mn6 

=  a6  *  p6 ; 

1090 

mn7 

=  a7  *  p7  ; 

1091 

1092 

anO 

=  mnO  +  mnl ; 

//  first  adder  stage 

1093 

an  1 

=  mn2  +  mn3 ; 

1094 

an2 

=  mn4  +  mn5 ; 

1095 

an3 

=  mn6  +  mn7 ; 

1096 

1097 

an4 

=  anO  +  anl  ; 

//  second  adder  stage 

1098 

an5 

=  an2  +  an3  ; 

1099 

1100 

dot8  =  an4  +  an5  ; 

//  root  node  output  adder 

1101 

1102 

// 

stream  next  dot8 

to  streaming  accumulator 

1103 

put 

_stream_flt(&Sva  , 

dot8  ,1); 

1104 

98 


1105 
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1120 

1121 
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1123 

1124 

1125 

1126 

1127 

1128 

1129 

1130 

1131 

1132 

1133 

1134 

1135 


}  //  for  i  (next  k— group ) 

stream.term (&Sva ) ;  //  termination  required  by  accum 

} 

#pragma  src  section 

{  //  send  #  of  dot8  ’  s  per  row  to  the  accumulator 

int  i  ; 

int  curr  ,  prev  ;  //  current  &  previous  to  calculate  counts 
int  dot8cnt  ;  //  number  of  dot8s  per  row 


//  note:  kptr  includes  n  +  l’th  item,  also 

//  transfer  curr  to  prev  is  part  of  loop  control 

prev  =  kptrL[0];  //  preload  initial  prev  value 

for  (i  =  l;  i  <=n ;  prev  =  curr  ,  i ++)  {  //  pipeline  depth:  5 

//  calculate  #  of 

curr  =  kptrL  [  i  ] ; 
dot8cnt  =  curr  —  prev ; 
put.stream  (&Sca  ,  dot8cnt  ,1); 

} 

stream.term (&Sca ) ;  //  termination  required  by  accum 


//  partial  dot  products 
//  per  matrix  row ,  send 
//  to  streaming  accum 


} 

#pragma  src  section 
{  //  streaming  accumulator 

stream_fp_accum_strm_counts_3  2_rr_term(&Sva,&Sca,&Ssa); 


//  input  dot  product  vals  (dot8s) - 1-  | 

//  input  #  of  dot8s  for  each  row  - h 

//  output  full  dot  product  stream  (dotN)  < - 

}  //  section 

#pragma  src  section 

{  //  finish  v  =  Ap  calculation  ,  store  in  BRAM  array 

float  dotNi  ;  //  the  i  ’  th  dotN  =  Ai  *  xp 
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1136 


int  i  ; 


99 
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for  (i=0;  i<n;  i ++)  {  //  pipeline  depth:  100 

get_stream_flt(&Ssa,&dotNi  ) ;  //  from  accumulator 

v  [  i  ]=  dotNi  ; 

//  use  a  multiply —accumulate  to  calculate 
//  dot(p,v) 

fp_mac_32  (dotNi  , pC8  [  i  ]  , (  i  ==n  —  1 ) ,  1  ,(  i  ==0) ,&pTv,&  err  ) ; 
//  A  A  .... 

//  V  [  i  ]  - +  |  |  I  I  I  I 

//  p[i]  - +  |  I  I  I  I 

//  terminate?  - H  III  I 

//  accumulate?  - 1  i  i 

//  reset?  - H  | 

//  output  dot(p,v)  < - h 

//  output  errors  < - v 

} 

}  //  section 

}  //  sections 

alpha  =  rTrold/pTv;  //  step  size 

for  ( i  =0;  i<n  ;  i  ++)  { 

xL[i]  +=  alpha  *  pC9 [ i ] ;  //  next  point 


//  we  use  the  scalar  ri  to  avoid  a 
//  mixed  read /write  2— cycle  loop,  i.e., 
//  cannot  update  an  array  and  then 
//  read  back  the  updated  value  in  the 
//  same  clock  cycle 

ri  =  r[i]  —  alpha  *  v  [  i  ] ;  //  residual 

r  [  i ]  =  ri  ; 
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//  use  a  multiply —accumulate  to  calculate  new  dot(r,r) 
fp_mac_32(ri  ,  ri  ,  (  i  ==n  —  1 ) ,  1  ,( i  ==0)  ,& rTrnew  ,& err  ) ; 

//  A  A  |  | 

//  r  |  i  |  - +  |  I  I  I  I  I 

//  r  |  i  |  - +  I  I  I  I  I 

//  terminate?  - +  III  I 

//  accumulate?  - t-  I  i 

//  reset?  - ¥  \  \ 

//  output  dot(r,r)  < - + 

//  output  errors  < - h 

} 

beta  =  rTrnew  /  rTrold  ;  //  projection  operator 

rTrold  =  rTrnew;  //  for  next  iteration 

f or  ( i  =0;  i<n  ;  i ++)  {  //  next  search  direction 

pnxt [ i ]  =  r[i]  +  beta  *  pCa [ i ] ; 

} 

//  ||r||/||b||  convergence  test 

r2nb2nL  =  sqrt(rTrold)  *  b2n_l  ; 

}  //  for  itersL...  (while  not  converged) 

//  converged:  send  the  result  back 
#pragma  src  parallel  sections 
{ 

#pragma  src  section 

{  //  grab  x’s,  pack,  and  stream  them  out 

float  prev  ,  curr  ; 
int64_t  packed ; 
int  i  ; 
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1200 


101 


86 

1201  prev  =  xL  [0] ; 

1202  //  note:  transfer  to  prev  is  part  of  loop  control 

1203  for  (1=1;  i<xsz  /  s  izeof  (  f  1  o  a  t  ) ;  prev  =  curr  ,  i ++)  {  //  pipeline  depth:  4 

1204  curr  =  xL  [  i  ] ; 

1205  if  (i%2)  {  //  odd,  so  combine 

1206  comb.3  2to64_f  1 1 _ f  It  (  curr  ,  prev  ,& packed  ) ; 

1207  } 

1208  put.stream  (&Sx ,  packed  ,(i%2));  //  when  it  is  odd 

1209  } 

1210  }  //  section 

1211  #pragma  src  section 

1212  {  //  DMA  stream  out  x  values  in  packed  stream 

1213  streamed_dma_gcm(&Sx , 

1214  STREAM.TO .PORT ,  PATH.O , xA ,  1  ,  (  in 1 64 _t  )  xsz  ) ; 

1215  }  //  section 

1216  }  //  sections 

1217 

1218  *  iters  =  itersL;  //  iteration  count 

1219  *r2nb2n  =  r2nb2nL  ;  //  residual  norm 

1220  return  ; 

1221  } 

1222 

1223  //  Compilation  statistics  from  nohup .  out 

1224  // 

1225  //  ############################################################## 

1226  //  ###############  PLACE  AND  ROUTE  SUMMARY  ############ 

1227  //  ALUTS:  38,732  /  143,520  (  27  %  ) 

1228  //  Registers:  71,301  /  143,520  (  50  %) 

1229  //  Logic  utilization:  84,634  /  143,520  (  59  %  ) 

1230  //  M512  rams:  81  /  930  (  9  %  ) 

1231  //  M4K  rams:  662  /  768  (  86  %  ) 


1232  //  M-RAMs : 


9  /  9  (  100  %  ) 


102 


1233  //  DSP  blocks:  210  /  768  (  27  %  ) 

1234  //  Fast  model  timing  requirements  succeeded  . 

1235  //  Slow  model  timing  requirements  succeeded  . 

1236  //  ############################################################## 

1237  // 


1238 

// 

Parallel 

region 

starting 

at 

line 

203 

has 

2 

parallel 

sections 

1239 

// 

Parallel 

region 

starting 

at 

line 

224 

has 

6 

parallel 

sections 

1240 

// 

Parallel 

region 

starting 

at 

line 

315 

has 

4 

parallel 

sections 

1241 

// 

Parallel 

region 

starting 

at 

line 

475 

has 

2 

parallel 

sections 

1242  // 

1243  //  ############################################################## 

1244  // 

1245  //  ##################  INNER  LOOP  SUMMARY  ############ 

1246  //  loop  on  line  240: 


1247 

// 

clocks  per  iteration: 

i 

1248 

// 

pipeline  depth  : 

5 

1249 

// 

1250 

// 

loop  on  line  261: 

1251 

// 

clocks  per  iteration: 

1 

1252 

// 

pipeline  depth  : 

5 

1253 

// 

1254 

// 

loop  on  line  279: 

1255 

// 

clocks  per  iteration: 

1 

1256 

// 

pipeline  depth  : 

89 

1257 

// 

1258 

// 

loop  on  line  301: 

1259 

// 

clocks  per  iteration: 

1 

1260 

// 

pipeline  depth  : 

5 

1261 

// 

1262 

// 

loop  on  line  321: 

1263 

// 

clocks  per  iteration: 

1 

1264 

// 

pipeline  depth  : 

65 

103 


88 


1265 

// 

1266 

// 

loop  on  line  398: 

1267 

// 

clocks  per  iteration: 

i 

1268 

// 

pipeline  depth  : 

5 

1269 

// 

1270 

// 

loop  on  line  419: 

1271 

// 

clocks  per  iteration: 

1 

1272 

// 

pipeline  depth  : 

89 

1273 

// 

1274 

// 

loop  on  line  439: 

1275 

// 

clocks  per  iteration: 

1 

1276 

// 

pipeline  depth  : 

110 

1277 

// 

1278 

// 

loop  on  line  466: 

1279 

// 

clocks  per  iteration: 

1 

1280 

// 

pipeline  depth  : 

25 

1281 

// 

1282 

// 

loop  on  line  485: 

1283 

// 

clocks  per  iteration: 

1 

1284 

// 

pipeline  depth  : 

4 

1285 

// 

############################################################## 

A 

.2  ’’Tuned”  Conjugate  Gradient  Development 

A 

.2.1  main2.c 

1286 

/* 

main  .  c 

1287 

sparse  matrix  jacobi  itera 

t  i  v  e  s 

ol  ver 

for 

SRC— 7  hardware 

1288 

Gerald  R.  Morris  ,  gerald  .  r 

.  morri; 

s @us . 

army  . 

mil  ,  1  Mar 

10 

1289 

refactored  by  Gerald  R.  M< 

orris  , 

7  Apr 

2011 

to  use  a 

1290 

residual  norm— based  conver 

gence 

test 

1291 

Refactored  by  Jamory  D.  Hawkins 

and  Anas  Alfarra  ,  Dec 

2013 

1292 

to  use  conjugate  gradient 

as  the 

iter 

a  t  i  o  n 

method 
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1293 

1294  There  are  actually  two  options  when  building  this  code 

1295  The  options  are  selected  via  defines  in  the  mvm.h  header  file 

1296  #define  SWVER  or  #define  HWVER 

1297  The  first  is  the  software  version  which  employs  the  standard 

1298  nested  loop  approach  to  do  the  calculation  serially. 

1299  The  second  is  the  k— aligned  hardware  version  ,  which  employ  loops 

1300  wherein  the  matrix  is  dealt  with  in  chunks  of  k  elements  at  a  time  . 

1301  k— alignment  requires  an  exact  multiple  of  k  elements  in  a  row 

1302  The  k— alignment  process  simply  pads  the  matrix  with  0’s 

1303  */ 

1304  #include  <stdio.h> 

1305  #include  <stdlib.h> 

1306  #include  <sys/time.h> 

1 307  #include  <string.  h> 

1308  #include  <stdint.h> 

1309  # include  <math  .  h> 

1310  #include  ’’mmio.h” 

1311  #ifndef  CG_H 

1312  #include  ”cg.h” 

1313  #endif 

1314  #include  ”  sparsekit  .  h” 

1315  #include  ’’usec.h” 

1316  #ifndef  PARMS_H 

1317  #include  ’’parms.h” 

1318  #endif 

1319 

1320  //  define  this  when  we  want  a  verbose  mode  (mostly  debug  stuff) 

1321  //#  define  VURBOZE 

1322 

1323  int  main  (int  argc  ,  char  *argv[])  { 


1324 
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1325  //  matrices  are  stored  in  MatrixMarket  coordinate  format 

1326  //  we  use  the  MatrixMarket  mmio  (I/O)  routines 


1327 

char  fname[255]; 

// 

1328 

FILE  *fmm; 

// 

1329 

MM.typecode  matcode  ; 

// 

1330 

i  n  t  n  ; 

// 

1331 

i  n  t  nc  ; 

// 

1332 

i n  t  nnz  ; 

// 

1333 

float  *acoo  ; 

// 

1334 

int  *icoo ; 

// 

1335 

int  *jcoo ; 

// 

1336 

1337 

//  eg  uses  compressed 

sparse 

1338 

float  * val  ; 

// 

1339 

int  *  col  ; 

// 

1340 

int  *  ptr ; 

// 

1341 

1342 

float  *x ; 

// 

1343 

float  *b ; 

// 

1344 

float  b2n_l  ; 

// 

1345 

float  r2nb2n  ; 

// 

1346 

1347 

char  result  [1024]; 

// 

1348 

int  iters; 

// 

1349 

FILE  *  fre s  ; 

// 

1350 

1351 

double  tO  ,  tl  ,  t2  ,  t3  ; 

// 

1352 

int  i , j  , k ; 

// 

1353 

1354 

# if def  VURBOZE  //  format 

string 

1355 

char  *fl=”\n%5d:  %9.2e 

1356 

char  *f2=”  (%4d,%4d) 

MatrixMarket  file 
MatrixMarket  file  pointer 
MatrixMarket  meta— data 

#  matrix  rows 

#  matrix  columns 

#  non— zero  elements 
matrix  values 
matrix  row  indices 
matrix  column  indices 

row  format 

CSR  sparse  matrix  (input  matrix) 

CSR  column  index 
CSR  row  pointer 

x  vector 

constant  vector 

(2  — norm  of  b  vector  )"(  — 1 ) 

residual  norm ( convergence  test) 

capture  results 

number  of  iterations  needed 

results  file  pointer 

to  capture  wall  clock  run  time 
loop  indices 

s  ,  etc  . .  when  we  run  in  verbose  mode 
%9.2e  %9.2e  %9.2e\n”; 

(%4d,%4d)  (%4d,%4d)  (%4d,%4d)”; 
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1357 

1358 

1359 

1360 

1361 

1362 

1363 

1364 

1365 

1366 

1367 

1368 

1369 

1370 

1371 

1372 

1373 

1374 

1375 

1376 

1377 

1378 

1379 

1380 

1381 

1382 

1383 

1384 

1385 

1386 

1387 

1388 


char  *f3  =”\n%5d:%9.2e%9.2e%9.2e%9.2e%6d%5d%5d%5d” ; 

char  *rl=”  val  col”; 

char  *r2=”  0  1  2  3  0  1  2  3”; 

char  *r3=” - 

char*indent=” 
char  *  spc 

int  head  =  16;  //  1st  few  matrix  values 

# endif  //  #ifdef  VURBOZE 

tO  =  usec();  //  first  start  time 

//  matrix  file  name  is  specified  on  the  command  line 
if  ( argc  ==  2)  { 

strcpy  ( fname  ,  argv  [  1  ] ) ; 

}  else  { 

fpri  ntf  (  stderr  ,”***  Usage:  ./eg  MM_file\n”); 
exit  ( 1 ) ; 

} 

//  open  the  input  and  result  files 

if  ( ( fmm  =  fopen  ( fname,  ”r”))  ==  NULL)  { 

fpri  ntf  (  stderr  ,  ’’Cannot  open  file  %s\n”,  fname); 
exit  ( 1 ) ; 

} 

fprintf  (  stderr  /’Reading  %s\n”  ,  fname  ) ; 
if  ((fres  =  fopen  (’’result”,  ”w”))  ==  NULL)  { 

fprintf  (stderr  ,  ’’failed  to  open  file  ’result  ’\n”); 
exit  ( 1 ) ; 

} 

//  get  matrix  information  from  MatrixMarket  file 
mm_read_banner  (fmm,  &matcode); 
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92 

1389  mm_read_mtx_crd_size  ( fmm ,  &n  ,  &nc  ,  &nnz  ) ; 

1390  fprintf  ( stderr  %dx%d  %s  with  %d  nonzero  values\n”, 

1391  n,nc,mm_typecode_to_str(  mate  ode  )  ,  nnz  ) ; 

1392 

1393  //  allocate  coordinate  format  arrays  now  that  sizes  are  known 

1394  acoo  =  (float  *)  malloc(nnz  *  sizeof  (  float  )) ; 

1395  icoo  =  (int  *)  malloc(nnz  *  sizeof  ( int  )) ; 

1396  jcoo  =  (int  *)  malloc(nnz  *  sizeof  ( int  )) ; 

1397 

1398  //  read  MatrixMarket  coordinate —format  matrix  data 

1399  for  (i=0;  i<nnz;  i ++)  { 

1400  fscanf(fmm,  ”%d  %d  %g\n”,  &icoo[i],  &jcoo[i],  &acoo[i]); 

1401  icoo[i] ;  //  adjust  from  1— based  to  0— based  for  C  arrays 

1402  jcoo  [  i ] - ; 

1403  } 

1 404  f  c  1  o  s  e  ( fmm ) ; 

1405 

1406  #ifdef  VURBOZE  //  dump  up  to  head  elements  of  input  matrix 

1407  p  r  i  n  t  f  (”%  s%s  ”  ,”COO:  ”,r3); 

1408  for  (i=0;  i<MIN(  nnz  ,3  *  head  ) ;  i +=4)  { 

1409  printf  (fl  , 

1410  i  ,  acoo  [i]  ,acoo[i+l]  ,acoo[i+2]  ,acoo[i  +  3]); 

1411  printf(f2,icoo[i],jcoo[i],icoo[i+l],jcoo[i+l], 

1412  icoo[i+2],jcoo[i+2],icoo[i+3],jcoo[i+3]); 

1413  } 

1414  printf(”  %s\n”,spc); 

1415  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

1416  # endif  //  #ifdef  VURBOZE 

1417 

1418  //  allocate  CSR  array  and  convert  to  CSR  format 

1419  val  =  (float  *)  malloc(nnz  *  sizeof  (  float  )) ; 

1420  col  =  (int  *)  malloc(nnz  *  sizeof  (  int  )) ; 


108 


93 

1421  ptr  =  (int  *)  malloc((n  +  l)  *  sizeof  (  int  ) ) ; 

1422  coocsr  (n  ,  nnz  ,  acoo  ,  icoo  ,  jcoo  ,  val  ,  col  ,  ptr  ) ;  //  SPARSKIT  routine 

1423 

1424  //  have  CSR  data  so  deallocate  MatrixMarket  arrays 

1425  free  (  acoo  ) ; 

1426  free  ( icoo  ) ; 

1 427  free  (jcoo); 

1428 

1429  #ifdef  VURBOZE  //  show  the  CSR  format  matrix  data 

1430  //  val  ,  and  col  vectors 

1431  printf (”%s%s\n”  , indent  ,rl); 

1432  printf  (”%s%s\n”  ,  indent  ,  r2  ) ; 

1433  printf  (”%s%s”,”CSR:  ”  ,r3  ); 

1434  for  (i=0;  i<MIN(  nnz  ,3 *  head  ) ;  i +=4)  { 

1435  printf  (f3  , 

1436  i  ,val[i]  ,  val  [  i  +1]  ,  val  [  i  +2]  ,  val  [  i  +  3]  , 

1 437  col  [i  ]  ,col  [i+1]  ,col  [i+2]  ,col  [  i  +  3]); 

1438  } 

1439  printf(”  %s\n”,spc); 

1440 

1441  //  and  ptr  array  (sideways) 

1442  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

1443  printf  (”  i  :  ”) ; 

1444  for  (i=0;  i<=MIN(n  ,  head  ) ;  i ++)  { 

1445  printf  (”%5d”  ,  i  ) ; 

1446  } 

1447  printf(”  %s\n”,spc); 

1448  printf(”  ptr:  ”); 

1449  for  (i=0;  i <=MIN( n  ,  head  ) ;  i ++)  { 

1450  printf  (”%5d”  ,  ptr  [  i  ]) ; 

1451  } 

1452  printf(”  %s\n”,spc); 
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1453  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

1454  #endif  //  #ifdef  VURBOZE 

1455 

1456  //  allocate  solution  and  constant  vector 

1457  x  =  (float  *)  malloc  (n  *  sizeof  (  float  )) ; 

1458  b  =  (float  *)  malloc  (n  *  sizeof  (  float  )) ; 

1459 

1460  //  calc  constant  vector  b  assuming  x[i]  =  1000.0 

1461  //  we’ll  dump  the  b  vector  to  our  result  file 

1462  //  also  calculate  2— norm  of  b  vector 

1463  b2n_l  =  0.0; 

1464  for  (i=0;  i<n;  i  ++)  { 

1465  b  [  i  ]  =  0.0; 

1466  for  (j  =  ptr  [  i  ] ;  j  <p  t  r  [  i  + 1  ] ;  j  ++)  {  //  sum(aij*xj) 

1467  b[  i  ]+=val  [j  ]*  1000.0; 

1468  } 

1469  b2n_l  +=  b[i]*b[i];  //  sum(b[i]A2) 

1470  fprintf(fres  b[%d]  =  %12.10e\n”,i,b[i]); 

1471  } 

1472  b2n_l  =  1/ sqrt  (  b2n_l  ) ;  //  (2  — norm  of  b  vector  )"(  —  1) 

1473 

1474  //  all  of  that  just  to  get  our  matrix  and  constant  vectors! 

1475  //  now  we  do  the  software  or  hardware  version  of  eg 

1476 

1477  //  do  the  eg  5  times  to  get  good  average 

1478  tl  =  usec();  //  first  stop  time,  second  start  time 

1479  f or  ( k=0;k<5;k++)  {  //  run  it  5  times  to  get  a  good  average 

1480  //  initialize  the  x  vector  to  all  zeros 

1481  //  basically  we  start  ’’guessing”  at  x[i]  =  0 

1482  for  (i=0;  i<n;  i ++)  { 

1483  x [  i  ]  =  0.0; 

1484  } 
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eg  (  val  ,  col  ,  ptr  ,  b  ,  n  ,  nnz  ,  x  ,  b2n_l  ,&r2nb2n  ,& i  t  e r  s  ) ;  //  solve  for  x 

} 

t2  =  usec();  //  second  stop  time,  third  start  time 

//  save  results 

for  (i=0;  i<n;  i ++)  { 

f  pri  ntf  (  fres  ,  ”x[%d]  =  %12. 10e\n”  ,  i  ,  x  [  i  ] ) ; 

} 

//  close  output  file  and  release  memory 

fclose ( fres ) ; 

free  (  val  ) ; 

free  (  col  ) ; 

free  (  ptr  ) ; 

//  report  to  stdout  and  quit 
t3  =  usec();  //  third  stop  time 

sprintf  (result  ,”%s  \t  %10d  \t  %6d  \t  %  12. 10  f  \t  %12.10f”, 
fname  ,  nnz  ,  iters  ,  r2nb2n  , 

(tl  -  tO)  +  (t2  -  tl  )/5  +  ( t3  -  t2  )) ; 
puts  (result); 
if  (  iters  <MAXITERS)  { 
exit  (0) ; 

}  else  { 

sprintf(result,”***  Maximum  iterations  (%d)  exceeded  ...”  ,  MAXITHRS); 
puts  (result  ); 
exit  ( 1 ) ; 

} 

} 

A. 2. 2  cg2.h 


1514  /*  Cg  .  h 
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1515  J  .  Hawkins  ,  A.  Alfarra  ,  and  G.  Morris  ,  Dec  2013 

1516  prototype  for  sparse  matrix  conjugate  gradient 

1517  J  .  Hawkins  and  G.  Morris  ,  Jan  2014 

1518  now  only  matrix  vector  multiply  is  executed  in  hardware 

1519  so  there  are  two  versions  :  software  and  hardware 

1520  */ 

1521  #  i  f  n  d  e  f  CG_H 

1522  #define  CG_H 

1523  #  e  n  d  i  f 

1524 

1 525  //  eg  iteration  function 


1526 

void  cg( 

1527 

float  val  []  , 

// 

CSR  sparse  matrix  values 

1528 

int  col  []  , 

// 

CSR  column  indices 

1529 

int  ptr  []  , 

// 

CSR  row  pointers 

1530 

float  b  []  , 

// 

known  constant  vector 

1531 

int  n , 

// 

matrix  order 

1532 

int  nnz  , 

// 

number  of  non— zeros 

1533 

float  x  []  , 

// 

solve  for  x 

1534 

float  b2n_l  , 

// 

(2  — norm  of  b  vector  )"(  — 1 ) 

1535 

float  *r2nb2n  , 

// 

ratio  of  2— norms  (convergence  test) 

1536 

int  *  iter s 

// 

iterations  to  solve  or  give  up 

1537  ); 

A. 2. 3  Cg2.C 

1538  /*  Cg  .  C 

1539  J  .  Hawkins  ,  A.  Alfarra  ,  and  G.  Morris  ,  Dec  2013 

1540  software  version  of  sparse  matrix  conjugate  gradient 

1541 

1542  refactored  Jan  2014  to  accomodate  matrix  vector  multiply 

1543  */ 

1 544  #include  < s  t  d  1  i  b  .  h> 
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1545 

//  #include  <math 

.  h> 

1546 

#ifndef  CG_H 

1547 

#include  ”cg . 

h” 

1548 

#endif 

1549 

#ifndef  MVMJH 

1550 

#  i  n  c  1  u  d  e  ”mvm .  h  ’ 

’ 

1551 

#endif 

1552 

#ifndef  PARMS_H 

1553 

#include  ”parms 

.h” 

1554 

#endif 

1555 

1556 

void  cg( 

1557 

float  val  []  , 

// 

CSR  sparse  matrix  values 

1558 

int  col  []  , 

// 

CSR  column  index 

1559 

int  ptr  []  , 

// 

CSR  row  pointer 

1560 

float  b  []  , 

// 

known  constant  vector 

1561 

int  n , 

// 

matrix  order 

1562 

int  nnz  , 

// 

number  of  non— zeros 

1563 

float  x  []  , 

// 

solve  for  x 

1564 

float  b2n_l  , 

// 

(2  — norm  of  b  vector  )"(  — 1 ) 

1565 

float  *r2nb2n  , 

// 

residual  norm  (convergence  test) 

1566 

int  liters)  { 

// 

iterations  to  solve  or  give  up 

1567 

1568 

/*  x_0  =  0  so 

Ax_0  =  0  ,  ergo  vector  Ax  not  needed 

1569 

float  *Ax; 

//  for  initial  search  direction 

1570 

1571 

//  p,  v  vectors 

allocated  for  either  software  or  hardware 

1572 

float  *p ; 

//  search  direction 

1573 

float  *v ; 

//  avoid  multiple  Ap  mvm 

1574 

1575 

float  *r  ; 

//  residual  vector 

1576 

float  alpha  ; 

//  step  size 
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1577 

float  rTrold  ; 

//  dot  (  r  (k  — 1) ,  r  (k  — 1)) 

1578 

float  rTrnew  ; 

//  dot  (  r  (k)  ,  r  (k)) 

1579 

float  beta  ; 

//  projection  operator 

1580 

float  normr  ; 

//  2— norm  of  r  ( |  |  r  |  | ) 

1581 

float  delta  ; 

//  residual  norm  =  ||r||/||b|| 

1582 

int  iterslcl  =  1; 

//  local  iteration  index 

1583 

float  sum; 

//  multi-use  accumulator 

1584 

int  i , j ; 

//  index  variables 

1585 

1586 

# if def  HWVER  //  if  it  ’ 

s  the 

hardware  version  ,  we  need  these 

1587 

II  for  k— 

aligned  CSR  data 

1588 

float  *kval  ; 

//  CSR  matrix  k— aligned  values 

1589 

int  * kcol  ; 

//  CSR  k— aligned  column  index 

1590 

int  *  kptr  ; 

//  CSR  k— aligned  row  pointer 

1591 

i n  t  knz ; 

//  k— aligned  #  of  non— zeros 

1592 

int  kvalsz  ; 

//  allocation  size  of  kval  (bytes) 

1593 

int  kcolsz ; 

//  allocation  size  of  kcol 

1594 

int  kptrsz  ; 

//  allocation  size  of  kptr 

1595 

i n t  psz ; 

//  allocation  size  of  vectors 

1596 

i n t  vsz ; 

1597 

int  gotA  =  0; 

//  true  when  A  is  already  on  FPGA 

1598 

//  for  MAP  allocation  and  usage 

1599 

int  nmaps  -  1; 

//  the  number  of  MAPs  needed 

1600 

int  mapnum  =  0; 

//  first  MAP  number  is  0 

1601 

1602 

//  we’ll  allocate  segments  using  both  OBCM  banks 

1603 

//  and  marshall  the 

MAP 

data  in  those  banks  (speed) 

1604 

gcm_seg_desc_t  *sdl 

; 

//  OBCM  segment  descriptor  pointers 

1605 

gcm_seg_desc_t  *sd2 

; 

//  for  bank  1  and  bank  2 

1606 

uint64_t  sdl sz  ; 

//  segment  sizes 

1607 

uint64_t  sd2sz ; 

1608 

//  address  of  the  MAP  data  allocated  from  the  2  segments  above 
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1609 

1610 

1611 

1612 

1613 

1614 

1615 

1616 

1617 

1618 

1619 

1620 

1621 

1622 

1623 

1624 

1625 

1626 

1627 

1628 

1629 

1630 

1631 

1632 

1633 

1634 

1635 

1636 

1637 

1638 

1639 

1640 


gcm_addr_t 

kvalA  ; 

// 

OBCM 

gcm_addr_t 

kcolA  ; 

// 

OBCM 

gcm_addr_t 

kptrA  ; 

// 

OBCM 

gcm_addr_t 

pA; 

// 

OBCM 

gcm_addr_t 

vA; 

// 

OBCM 

address  of  CSR  matrix  values 
address  of  CSR  column  indices 
address  of  CSR  row  pointers 
address  of  current  search  direction 
address  of  v=ap 


//  HWVER:  allocate  and  load  the  matrix  into  cache  — aligned  buffers. 

//  as  part  of  software  /  hardware  codesign  tradeoffs  the 

//  k— alignment  of  A  will  be  done  here  in  software 

//  the  MAP  wants  the  CSR  vectors  to  be  aligned  on  cache 

//  boundaries.  We  can’t  place  this  restriction  on  upper  — level 

//  callers  ,  so  we  will  marshall  the  data  here  into 

//  properly  aligned  arrays.  Since  we  have  to  do  the  memcopy 

//  anyway,  we’ll  just  go  ahead  and  do  the  k— alignment  needed 

//  by  our  MAP  routine  here  in  software. 

//  figure  out  the  growth  in  kval  and  kcol  due  to  the 
//  k— alignment  processing,  i.e.,  calculate  knz 
knz  =  0; 

for  (i=0;  i<n;  i ++)  {  //  loop  across  all  matrix  rows 

for  (j  =  ptr[i  ];  j<ptr[i  +  l];  j ++)  { 

knz+  +  ;  //  count  #  non— zeros 

} 

while  ((knz^K)  !=  0)  {  //  pad  with  0’s  if  necessary 

knz+  +  ;  //  to  make  knz  a  multiple  of  K 

} 

} 

//  now  we  have  the  knz  value  ,  so  we  can  calculate  array  sizes 
//  use  SALIGN  macro  (mvm.h)  to  avoid  cache  boundary  problems 
kvalsz  =  SALIGN  (knz*sizeof(float)  ,  CACHEALIGN ) ; 
kcolsz  =  SALIGN  (knz*sizeof(int),  CACHEALIGN ) ; 


99 


115 


1641 

1642 

1643 

1644 

1645 

1646 

1647 

1648 

1649 

1650 

1651 

1652 

1653 

1654 

1655 

1656 
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kptrsz  =  SALIGN((n  +  l)*  sizeof  ( int  )  ,  CACHE  ALIGN ) ; 
psz  =  SALIGN(n*  sizeof  (  float  )  , C ACHEALIGN ) ; 
vsz  =  SALIGN(n*sizeof  (  float  )  ,  C ACHEALIGN ) ; 

//  allocate  arrays  for  the  k— aligned  CSR  matrix 
kval  =  (  float  *)  Cache_Aligned_Allocate  (kvalsz); 
kcol  =  (int*)  Cache_Aligned_Allocate  (kcolsz); 
kptr  =  (int*)  Cache_Aligned_Allocate  (kptrsz); 

//  allocate  vectors  for  search  direction  and  v=ap  result 
//  these  are  allocated  differently  for  software 
p  =  (  f  1  o  at  *)  Cache_Aligned_Allocate  (psz); 
v  =  (  f  1  o  at  *)  Cache_Aligned_Allocate  (vsz); 

//  also  need  OBCM  bank  segment  descriptors 

//  to  speed  up  MAP  processing  ,  place  data  in  separate  banks 
//  kval  and  kptr  go  into  OBCM  bank  1 
//  kcol,  p,  and  v  go  into  OBCM  bank  2 
//  per  SRC7CPEGuide ,  must  start  on  8— byte 

//  boundaries  and  be  allocated  in  length  that  is  a  multiple  of  8 

//  this  is  already  guaranteed  because  of  our  cache  aligned  sizes 

sdlsz  =  (  uint64_t )  kvalsz  +  (  uint64_t )  kptrsz  ; 

sd2sz  =  ( uint64_t ) kcolsz  +  ( uint64_t ) psz  +  ( uint64_t ) vsz ; 

if  (  gcm.allocate _seg_by  .bank  (  sdlsz  ,  1,  &sdl  ))  { 

f  pri  ntf  (  stderr  ,”Bank  1  OBCM  segment  allocation  failed\n”); 
exit  ( 1 ) ; 

} 

if  (  gcm.allocate  _seg_by  .bank  (  sd2sz  ,  2,  &sd2))  { 

f  pri  ntf  (  stderr  ,”Bank  2  OBCM  segment  allocation  failed\n”); 
exit  ( 1 ) ; 

} 

//  now  allocate  the  buffer  addresses  in  OBCM  segments 
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if  ( !  (  kvalA  =  gcm.allocate  _from_seg  (  kvalsz  ,  sdl  ) ) )  { 

f  pri  ntf  (  stderr  ,”OBCM  allocation  for  kval  failed\n”); 
exit  ( 1 ) ; 

} 

if  ( !  (  kcolA  =  gcm.allocate  _from_seg  (  kcolsz  ,  sd2  ) ) )  { 

fpri  ntf  (  stderr  ,”OBCM  allocation  for  kcol  failed\n”); 
exit  ( 1 ) ; 

} 

if  ( !  (  kptr A  =  gcm.allocate  _from_seg  (  kptrsz  ,  sdl  ) ) )  { 

fpri  ntf  (  stderr  ,”OBCM  allocation  for  kptr  failed\n”); 
exit  ( 1 ) ; 

} 

if  (!(pA  =  gem  .allocate  _from_  seg  ( psz  ,  sd2  )) )  { 

fprintf  (  stderr  ,”OBCM  allocation  for  p  failed\n”); 
exit  ( 1 ) ; 

} 

if  (!(vA  =  gcm.allocate  _from_seg  ( vsz  ,  sd2  ) ) )  { 

fpri  ntf  (  stderr  ,”OBCM  allocation  for  v  failed\n”); 
exit  ( 1 ) ; 

} 

//  k— align  the  kval  ,  kcol  ,  and  kptr  arrays 

//  the  meaning  of  kptr  is  slightly  different  than  ptr 

//  recall  ,  MAP  will  be  grabbing  kcol  and  kval  elements 

//  K  per  clock  cycle  ,  so  kptr  expresses  things  in  groups  of  K 

knz  =  0; 

for  (i=0;  i<n;  i ++)  {  //  all  matrix  rows 

kptr  [  i  ]  =  knz/K;  //  k— aligned  ptr  value 

for  (j  =  ptr[i  ];  j<ptr[i  +  l];  j  ++)  { 
kval  [  knz  ]  =  val  [  j  ] ; 
kcol[knz++]  =  col [ j ] ; 
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1704 


} 
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1705 

while  ((knz%K)  !=  0)  { 

// 

pad  with  0’s 

1706 

kval[knz]  =  0.0; 

// 

to  fill  a  group  of  K 

1707 

kcol[knz++]  =  0; 

// 

since  a  [  i  ]  [  j  ]=0  , 

1708 

} 

// 

the  p  used  is  negligent 

1709 

} 

1710 

kptr  [  i  ]  =  knz  /K; 

// 

similar  to  ptr  [n  +  1] 

1711 

1712  #ifdef  VURBOZE  //  show  the  k— aligned  sizes  and  data 

1713  printf  (”  words  :  n  ,  knz  =  %d,%d\n”  ,n  ,  knz  ) ; 

1714  printf(”\nbytes:  |kval  |=%d  ,  |  kcol  |=%d  ,  |  kptr  |=%d  .  |  vec  |=%d\n”  , 

1715  kvalsz  ,  kcolsz  ,  kptrsz  ,  xsz  ) ; 

1716 

1717  //  show  the  k— aligned  kval  ,  and  kcol  vectors 

1718  printf  (”  \  n%s%s\n”  .indent  ,kl); 

1719  print f  (”% s%s\n”  .  indent  ,  r2  ) ; 

1720  p  r  i  n  t  f  (”%  s%s  ”  ,  ”  kCSR :  ”,r3); 

1721  for  (i=0;  i<MIN(  knz  ,  3  *  head  ) ;  i+=4)  { 

1722  printf  ( f3  , 

1723  i,kval[i],kval[i+l],kval[i+2],kval[i+3], 

1724  kcol  [i]  .kcol  f  i  +  1]  .kcol  f  i+2]  ,  kcol  [  i  +  3]); 

1725  } 

1726  printf  (”  %s\n”,spc); 

1727 

1728  //  show  the  k— aligned  kptr  array  (sideways) 

1729  printf  C'\n  i  :  ”); 

1730  for  (i=0;  i<=MIN(n  ,  head  ) ;  i++)  { 

1731  printf  (”%4d”  ,  i  ) ; 

1732  } 

1733  printf(”  %s\n”,spc); 

1734  printf  (”  kptr  :  ”) ; 

1735  for  (i=0;  i<=MIN(n  ,  head  ) ; 

1736  printf  (”%4d”  ,  kptr  [  i  ])  ; 


i  ++)  { 
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1737  } 

1738  printf(”  %s\n”,spc); 

1739  printf(”  %s\n”,spc); 

1740  printf  (”%s%s\n”  ,  indent  ,  r3  ) ; 

1741  # endif  //  #ifdef  VURBOZE 

1742 

1743  if  ( map.allocate  ( nmaps ) )  {  //  allocate  the  MAP 

1744  fpri  ntf  (  stderr  Could  not  allocate  MAP.  Bye!\n”); 

1745  exit  ( 1 ) ; 

1746  } 

1747 

1748  #else  //  SWVER 

1749 

1750  //  p  and  v  vectors  are  allocated  differently  for  software 

1751  p  =  (float  *)  malloc  (n*  sizeof  (  float  )) ; 

1752  v  =  (float  *)  malloc  (n*  sizeof  (  float  )) ; 

1753  # endif  //  HWVER 

1754 

1755 

1756  //  allocate  our  residual  vector 

1757  r  =  (float  *)  malloc  (n*  sizeof  (  float  )) ; 

1758  //  Ax  =  (float  *)  malloc  (n*  sizeof  (  float  )) ; 

1759 

1760  //  how  to  calculate  Ax_0  if  were  necessary 

1761  /*  x_0  =  0  so  Ax_0  =  0,  ergo  no  calculation 

1762  for  ( i  =0;  i<n  ;  i ++)  { 

1763  sum  =  0;  //  recall  A  is  sparse 

1764  for  (j  =  ptr  [  i  ] ;  j  <p  t  r  [  i  +1];  j  ++)  { 

1765  sum  +=  val[j]*x[col[j]]; 

1766  } 

1767  Ax[i]  =  sum; 

1768  } 
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1769 

*/ 

1770 

1771 

n 

algorithms  lines 

2,3,5,  and  7 

1772 

n 

recall  Ax_0=0  so 

pi  =  r0=b 

1773 

n 

1  st  A— orthogonal 

search  direction  (pi) 

1774 

n 

0th  residual  ( rO 

=  pd 

1775 

n 

rTrold  =  dot ( rO  , 

r0) 

1776  rTrold  =  0; 

1777  for  ( i  =0;  i<n  ;  i ++)  { 


1778 

x [  i  ]  =  0; 

//  want  x_0  =  0 

1779 

P  [  i  ]  =  b  [  i  ] ; 

//  pi  =  b— Ax_0 

1780 

r[i]  =  p[ i  ] ; 

//  rO  =  pi 

1781 

rTrold  +=  r [ i ]*  r [  i  ] 

;  //  dot ( r (0) , r (0) ) 

1782 

} 

1783 

1784 

delta  =  EPSILON  +  1.0; 

//  force  loop  entry 

1785 

// 

loop  until  converge 

or  max  iters  exceeded 

1786 

while)  (delta  >EPSILON)&&(  i  t  e  r  s  1  c  1  <MAXTTERS ) )  { 

1787 

1788 

//  alg  line  10:  need  Ap  twice  ,  so  calc  once 

1789 

#ifndef 

HWVER 

1790 

//  SWVER: 

1791 

mvm(  val  ,  col  ,  ptr  ,  p  ,  n 

>  v ) ; 

1792 

1793 

#else 

1794 

//  HWVER: 

1795 

if  ( !  gotA  )  {  //  only 

copy  A  one  time 

1796 

//  copy  data  to 

the  OBCM  buffers 

1797 

gem _cp_to  (kvalA 

,  kval  ,  kvalsz  ) ; 

1798 

gem _cp_to  (kcolA 

,  kcol ,  kcolsz ) ; 

1799 

gcm_cp_to  ( kptrA 

,  kptr  ,  kptrsz  ) ; 

1800 

} 
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1801  gcm_cp_to  (pA ,  p,  psz);  //  always  need  p 

1802  mvm(kvalA,  kcolA  ,  kptrA  ,  pA,  n,  knz  ,  vA,  gotA  ,  mapnum); 

1803  gotA  =  1; 

1804  gcm_cp_from  (vA ,  v,  vsz);  //  copy  result  back  from  OBCM  buffer 

1805 

1806  #endif  //  HWVER 

1807 

1808  //  alg  11:  calc  step  size  (alpha) 

1809  sum  =  0; 

1810  for  ( i  =0;  i<n  ;  i ++)  { 

1811  sum  +=  p[i]*v[i]; 

1812  } 

1813  alpha  =  rTrold/sum; 

1814 

1815  //  alg  12:  calc  next  x  and 

1816  //  alg  16:  residual  vector  and 

1817  //  alg  1  8 :  dot  ( r  ,  r ) 

1818  rTrnew  =  0; 

1819  for  ( i  =0;  i<n  ;  i ++)  { 

1820  x  [  i  ]  +=  alpha  *p[i]; 

1821  r  [  i  ]  — =  alpha*v[i]; 

1822  rTrnew  +=  r[i]*r[i]; 

1823  } 

1824  normr  =  sqrt  ( rTrnew ) ;  //  ||r||,  part  of  alg  22 

1825 

1826  beta  =  rTrnew  /  rTrold  ;  //  alg  19:  projection  operator 

1827 

1828  rTrold  =  rTrnew;  //  alg  20:  for  next  iteration 

1829 

1830  //  alg  21:  calc  next  search  direction 

1831  for  ( i  =0;  i<n  ;  i ++)  { 


1832 


p [ i ]  =  r [ i ]+ beta *p [ i ] ; 
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1833 

1834 

1835 

1836 

1837 

1838 

1839 

1840 

1841 

1842 

1843 

1844 

1845 

1846 

1847 

1848 

1849 

1850 

1851 

1852 

1853 

1854 

1855 

1856 

1857 

1858 

1859 

1860 

1861 

1862 

1863 

1864 


} 


delta  =  normr  *  b2n_l  ;  //  alg  22:  residual  norm 

iterslcl++;  //  next  iteration 

}  //  end  while 

liters  =  iterslcl;  //  report  iterations 

*r2nb2n  =  delta  ;  //  and  residual  norm 

//  release  vectors 

/*  x_0  =  0  so  Ax_0  =  0,  ergo  no  Ax  vector 
f  r  e  e ( Ax ) ;  *  / 
free  (  r  ) ; 

//  release  all  the  stuff  that  was  allocated 
#ifndef  HWVER 
free  (p ) ; 
free  ( v  ) ; 

#else 

//  deallocate  the  MAP 
if  (  map.free  ( nmaps  ) )  { 

fpri  ntf  (  stderr  , ’’Could  not  deallocate  MAP.  Bye!\n”); 
exit  ( 1 ) ; 

} 

//  arrays 

Cache  .Aligned  .Free  ((char*)kval); 

Cache  .Aligned  .Free  ((char*)kcol); 

Cache  .Aligned  .Free  ((char*)kptr); 

Cache  .Aligned  .Free  ((char*)p); 

Cache  .Aligned  .Free  ((char*)v); 
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//  OBCM  buffers 
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1865  gcm.free  ( kvalA  ) ; 

1866  gcm.free  ( kcolA  ) ; 

1867  gcm.free  (  kptrA  ) ; 

1868  gcm.free  (pA) ; 

1869  gcm.free  (vA) ; 

1870  //  OBCM  segments 

1871  gcm_free_seg  (  sd  1 

1872  gcm_free_seg  ( sd2 

1873  #endif  //  HWVER 

1874 

1875  }  //  end  eg 

A. 2. 4  parms.h 


); 

); 


1876  /*  parms.h 

1877  J.  Hawkins  and  G.  Morris  ,  Jan  2014 

1878  parameters  needed  for  all  dependent  files 

1879  */ 

1880  #  i  f  n  d  e  f  PARMSJf 

1881  #define  PARMSJf 

1882  #  e  n  d  i  f 

1883 

1884  //  sometimes  MIN  and  MAX  macros  already  defined 

1885  #  i  f  n  d  e  f  MIN 

1886  #define  MIN(x,y)  ( ( ( x)  <(y )  ?  ( x  ) :  (  y ) ) ) 

1887  #define  MAX(x,y)  ( ( ( x)  >(y )  ?  ( x  ) :  (  y ) ) ) 

1888  #  e  n  d  i  f 

1889 

1890  //  maximum  iterations  ,  convergence  criterion  ,  max  n 

1891  #define  MAXTTERS  (100000) 

1892  #define  EPSILON  (le-6) 

1893  #define  NMAX  (8192) 
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A. 2. 5  mvm.h 


1 894  /  *  mvm .  h 

1895  J  .  Hawkins  and  G.  Morris  ,  Jan  2014 

1896  now  only  matrix  vector  multiply  is  executed  in  hardware 

1897  so  there  are  two  versions  :  software  and  hardware 

1898  */ 

1899  # ifndef  MVMJH 

1900  #define  MVMJH 

1901  #endif 

1902 

1903  #define  HWVER 

1904  #ifdef  HWVER 

1905  #include  <libmap.h> 

1906  #  e  n  d  i  f 

1907 

1908  //  width  of  the  binary  tree 

1909  #define  K  (8) 

1910 

1911  #ifndef  HWVER 

1912 

1913  //  SWVER:  the  software  mvm  function 

1914  void  mvm( 


1915 

float  val  []  , 

//  CSR  sparse  matrix  values 

1916 

int  col [] , 

//  CSR  column  indices 

1917 

int  ptr  []  , 

//  CSR  row  pointers 

1918 

float  p  []  , 

//  current  search  direction 

1919 

int  n , 

//  matrix  order 

1920 

float  v  [] 

//  v  =  ap 

1921  ); 

1922 

1923  #  e  1  s  e 

1924 


order 
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1925  //  HWVER:  the  hardware  mvm  function 

1926  void  mvm( 


1927 

gcm.addr  _t 

kvalA  , 

// 

GCM  address 

of 

CSR  matrix  values 

1928 

gcm.addr  _t 

kcolA  , 

// 

GCM  address 

of 

CSR  column  indices 

1929 

gcm.addr  _t 

kptrA  , 

// 

GCM  address 

of 

CSR  row  pointers 

1930 

gc  m  _addr  _t 

pA, 

// 

GCM  address 

of 

curr  search  direction 

1931 

int  n , 

// 

number  of  matrix  rows  and  columns 

1932 

int  knz  , 

// 

k— aligned  number  of  non— zeros 

1933 

gc  m  _addr  _t 

vA, 

// 

GCM  address 

of 

v  =  ap 

1934 

int  got  A  , 

// 

true  when  A 

i  s 

already  on  FPGA 

1935  int  mapnum);  //  MAP  number  required  by  Carte 

1936 

1937  //  cache  alignment  (%  more  /  proc  /  cpuinfo  ) 

1938  # define  CACHEALIGN  (128) 

1939 

1940  //  return  next  integer  multiple  of  sz  >=  p 

1941  #  define  SALIGN(p,  sz)  ( ( ( (  p)%(  sz  ))  =  =  0)  ?((  p  ) ) :  ( (  p)  +  (  sz )  -((p)%(  sz  ) ) ) ) 

1942 

1943  # endif  //  HWVER 

A. 2. 6  mvm.c 

1 944  /  *  mvm .  c 

1945  J  .  Hawkins  and  G.  Morris  ,  Jan  2014 

1946  now  only  matrix  vector  multiply  is  executed  in  hardware 

1 947  software  version 

1948  */ 

1949  # ifndef  MVMJH 

1 950  #  i  n  c  1  u  d  e  ”mvm .  h  ” 

1951  #  endif 

1952 

1953  void  mvm( 

1954  float  val  []  ,  //  CSR  sparse  matrix  values 
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1955 

int  col  []  , 

// 

CSR  column  indices 

1956 

int  ptr  []  , 

// 

CSR  row  pointers 

1957 

float  p  []  , 

// 

current  search  direction 

1958 

int  n , 

// 

matrix  order 

1959 

float  v  [])  { 

// 

v  =  ap 

1960 

1961 

float  sum; 

// 

accumulator 

1962 

int  i  ,  j  ; 

// 

index  variables 

1963 

1964 

//  need  Ap 

twice 

,  so  calc  once 

1965 

for  ( i  =0;  i<n  ;  i  ++) 

{ 

1966 

sum  = 

0;  // 

recall  A  is  sparse 

1967 

f°r(j  = 

ptr  [  i  ] 

;  j  <Ptr  [  i  + 1  ] ;  j  ++)  { 

1968 

sum  +=  \ 

ral  [  j  ]*p[  col  [  j  ]] ; 

1969 

} 

1970 

v  [  i  ]  = 

sum ; 

1971 

} 

1972 

}  //  end  mvm 

A. 2. 7  mvm.mc 

1 973  /  /  mvm .  me 

1974  //  sparse  matrix  vector  multiplication  , 

1975  //  Jerry  Morris  &  Jamory  Hawkins,  Jan  2014 

1976  //  partly  based  on  Morris’  April  2011  Jacobi  solver 

1977  //  this  is  the  hardware  version  that  runs  on  the  SRC— 7  MAP 

1978 

1979  #include  <libmap.h> 

1980  #include  ’’mvrn.h” 

1981  #ifndef  PARMS_H 

1982  #include  ’’parms.h” 

1983  #  e  n  d  i  f 


1984 
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1985  void  mvm( 

1986  gcm.addr.t  kvalA  ,  //  GCM  address  of  CSR  matrix  values 

1987  gcm_addr_t  kcolA  ,  //  GCM  address  of  CSR  column  indices 

1988  gcm.addr.t  kptrA  ,  //  GCM  address  of  CSR  row  pointers 

1989  gcm_addr_t  pA,  //  GCM  address  of  curr  search  direction 

1990  int  n,  //  number  of  matrix  rows  and  columns 

1991  int  knz  ,  //  k— aligned  number  of  non— zeros 

1992  gcm.addr.t  vA,  //  GCM  address  of  v  =  ap 

1993  int  gotA  ,  //  true  when  A  is  already  on  FPGA 

1994  int  mapnum)  {  //  MAP  number  required  by  Carte 

1995 

1996  //  k— striped  local  copy  of  k— aligned  matrix  values  (kval)  in  OBM 

1997  //  2— pack  (0,1) 

1998  //  2— pack  (2 ,3) 

1999  //  2— pack  (4  ,5) 

2000  //  2— pack  (6  ,7) 

2001  OBM_BANK_A( kvalLOl  ,  int64_t  .  MAX.OBM.SIZE) 

2002  OBM_BANK_B( kvalL23  ,  int64_t  .  MAX.OBM.SIZE) 

2003  OBM_BANK_C( kvalL45  ,  int64_t  .  MAX.OBM.SIZE) 

2004  OBM_BANK_D( kvalL67  ,  int64_t  .  MAX.OBM.SIZE) 

2005 

2006  //  k— striped  local  copy  of  k— aligned  matrix  columns  (kcol)  in  OBM 

2007  //  2— pack  (0,1) 

2008  //  2— pack  (2,3) 

2009  //  2— pack  (4  ,5) 

2010  //  2— pack  (6  ,7) 

2011  OBM_BANK_E( kcolLOl  ,  int64_t  .  MAX.OBM.SIZE) 

2012  OBM_BANK_F(  kcolL23  ,  int64_t  .  MAX.OBM.SIZE) 

2013  OBM_BANK_G( kcolL45  ,  int64.t  .  MAX.OBM.SIZE) 

2014  OBM_BANK_H( kcolL67  ,  int64.t  .  MAX.OBM.SIZE) 

2015 

2016  //  all  other  vectors  go  into  BRAM  on  the  Altera  FPGA 
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2017  //  we  play  some  games  to  balance  use  of  the  BRAM  memory. 

2018  //  EP2S180  FPGA  has  768  M4K  RAM,  930  M512  RAM,  and  9  MRAM  blocks. 

2019  //  we  consume  662  of  768  M4K  BRAMs,  81  of  930  M512  BRAMs, 

2020  //  and  9  of  9  MRAM  blocks  .  Basically  we  declare  some  arrays 

2021  //  bigger  than  they  need  to  be  so  that  Carte  puts  them  in 

2022  //  MRAM  rather  than  M4K.  see  SRC— 7  Carte  C  Programming 

2023  //  Environment  Guide  for  additional  details. 

2024 

2025  int  kptrL  [2*NMAX] ;  //  local  copies  of  kptr 

2026 

2027  //  the  dot8  tree  needs  independent  copies  of  p 

2028  //  to  avoid  multiple —cycle  memory  accesses 

2029  float  pC0[NMAX];  //  the  p  copy  (pC*)  are  the 

2030  float  pCl  [NMAX] ;  //  p  inputs  for  each  leaf  node 

2031  float  pC2[NMAX];  //  of  a  size  k=8  dot  product  unit 

2032  float  pC3[2*NMAX];  //  the  dot8  is  fully  pipelined 

2033  float  pC4[2*NMAX];  //  the  array  A  inputs  to  the  dot8 

2034  float  pC5[2*NMAX];  //  come  from  the  striped  OBM  banks 

2035  float  pC6[2*NMAX]; 

2036  float  pC7  [2*NMAX] ; 

2037 

2038  //  input  vector  stream  processing  is  as  follows: 

2039  //  1.  DMAstream  in  264— bit  packed  values  from  GCM 

2040  //  2.  stream  width  convert  into  32— bit  stream 

2041  //  4.  unload  single  32— bit  stream  into  local  BRAM  arrays 

2042  Stream_256  Skptr256  ;  //  kptr  streams 

2043  Stream_32  Skptr32; 

2044  Stream_256  Sp256_l  ;  //  p  streams 

2045  Stream_256  Sp256_2;  //  p  streams 

2046  Stream_32  Sp32_l  ;  //  the  streaming  accumulator 

2047  Stream_32  Sp32_2 ;  //  the  streaming  accumulator 


bears  explanation 
bears  explanation 
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2049 

// 

2050 

// 

suppose  I  have  3  vectors:  a  =  (l,2,3),  b  =  (4,5),  c  =  (6,7,8,9) 

2051 

// 

and 

I  want  the  three  vector  sums:  sa  =  l+2+3=6,  sb=9,  s c  =  3 0 

2052 

// 

further  suppose  the  vector  data  is  being  delivered  as 

2053 

// 

a  single  stream,  vals  =  1  ,2  ,3  ,4  ,5  ,6  ,7  ,8  ,9 

2054 

// 

and 

I  have  another  stream,  cnts  =  3,2,4  (sizes  of  a,  b,  c) 

2055 

// 

I  use  the  streaming  accumulator  to  produce 

2056 

// 

a  third  stream,  sums  =  6,9,30. 

2057 

// 

Here 

’s  an  elided  pseudo— code  of  the  four  parallel  sections 

2058 

// 

that 

one  would  use  to  accomplish  this  task 

2059 

// 

2060 

// 

pars 

{ 

2061 

// 

par 

{  //  stream  out  values 

2062 

// 

for  i  =1  ,9  { 

2063 

// 

put.stream  (&Sva  ,  vals[i]) 

2064 

// 

} 

2065 

// 

} 

2066 

// 

par 

{  //  stream  out  counts 

2067 

// 

for  i  =1  ,3  { 

2068 

// 

put_stream(&Sca  ,  cnts  [  i  ])) 

2069 

// 

} 

2070 

// 

par 

{  //  eat  vals  ,  cnts  and  stream  out  sums 

2071 

// 

stream_fp_accum_strm_counts_3  2_rr_term(&Sva,&Sca,&Ssa) 

2072 

// 

} 

2073 

// 

par 

{  //  save  sums 

2074 

// 

for  i  =1  ,3  { 

2075 

// 

get_stream(&Ssa  ,&sums  [  i  ] ) ; 

2076 

// 

} 

2077 

// 

} 

2078 

// 

2079 

// 

in  our  case  accumulator  stream  processing  is  as  follows  : 

2080 


//  1.  compute  dot8  patial  dot  products:  stream  dot8  vals  to  accum 
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2081 

// 

2.  compute  number  of  dot8s  per  row: 

stream 

ents  to  accum 

2082 

// 

3.  accum  sums 

dot8s  &  outputs  n  full 

dot  p 

•  roducts  v  [  i  ] 

2083 

// 

4.  pack  the  v[ 

i]  and  stream  up  to  OBCM 

2084 

// 

2085 

Stream_32  Sva; 

//  value  stream  (dot8 

s  )  to 

accum 

2086 

Stream_32  Sea; 

//  count  stream  to  accum  (# 

dot8s  per  row) 

2087 

Stream_32  Ssa; 

//  sums  (dotNs)=  v[i] 

from 

accum 

2088 

2089 

// 

output  stream  p 

>rocessing 

2090 

Stream_32  Sv32 ;  // 

the  output  stream  for 

V 

2091 

Stream_64  Sv ;  // 

packed  result  stream 

2092 

2093 

// 

column  numbers 

allow  match  of  A[i][j] 

with 

p  [  j  ] 

2094 

int 

jo  ,jl  ,j2  , j 3  ,j4 

» j5  ,  j 6  ,  j 7  ; 

2095 

2096 

// 

A  and  p  inputs 

to  the  binary  tree  dot 

produ 

ict  unit 

2097 

flo 

at  aO  ,  al  ,  a2  ,  a3  , 

a4  ,  a5  ,  a6  ,  a7  ; 

2098 

flo 

at  pO  ,  pi  ,  p2  ,  p3  , 

p4  ,  p5  ,  p6  ,  p7  ; 

2099 

2100 

// 

dot  product  tre 

e  multiplier  node  and 

adder 

node  outputs 

2101 

flo 

at  mn0,mnl,mn2, 

mn3  ,  mn4 ,  mn5  ,  mn6 ,  mn7  ; 

2102 

flo 

at  anO  ,  anl  ,  an2  , 

an3  ,  an4  ,  an5  ,  dot8  ; 

2103 

2104 

// 

#  bytes  :  kval  , 

kcol  ,  kptr  ,  p ,  and  v 

2105 

int 

kvalsz  ,  kcolsz  , 

kptrsz  ,  psz  ,  vsz  ; 

2106 

2107 

int 

i.j;  // 

loops 

2108 

2109 

// 

number  of  bytes 

to  be  transferred 

2110 

// 

use  SALIGN  macro  (mvm.h)  to  avoid  cac 

he  bou 

indary  problems 

2111 

psz 

=  SALIGN  ( n  *  s  i  z 

eof  (  float  )  .CACHEALIGN) 

’ 

2112 

vsz 

=  SALIGN  ( n  *  s  i  z 

eof  (  float  )  .CACHEALIGN) 

; 
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2113 

2114 

2115 

2116 

2117 

2118 

2119 

2120 

2121 

2122 

2123 

2124 

2125 

2126 

2127 

2128 

2129 

2130 

2131 

2132 

2133 

2134 

2135 

2136 

2137 

2138 

2139 

2140 

2141 

2142 

2143 

2144 


i  f  ( !  gotA )  {  //  only  need  to  bring  in  A  one  time 

kvalsz  =  SALIGN ( knz  *  sizeof  (  float  )  , CACHEALIGN ) ; 
kcolsz  =  SALIGN  ( knz  *  s i z  e  o  f (  i  n  t  )  ,  CACHEALIGN ) ; 
kptrsz  =  SALIGN((n  +  l)*  sizeof  ( int  ),  CACHEALIGN); 

//  grab  kval  and  kcol  (in  different  GCM  banks,  per  cg.c) 

#pragma  src  parallel  sections 

{ 

#pragma  src  section 

{  //  stripe  packed  kval  vector  from  GCM  across  four  OBMs 

buffered_dma_gcm(GCM2OBM,PATH_0,kvalL01  , 

MAP_OBM_stripe  (1  ,”A,B,C,D”)  ,  kval  A  ,  1  ,  kvalsz  ) ; 

} 

#pragma  src  section 

{  //  stripe  packed  kcol  vector  from  GCM  across  four  OBMs 

buffered_dma_gcm  (GCM20BM,  PATH.l  ,  kcolLOl  , 

MAP_OBM_stripe  ( 1  ,”E,F,G,H”)  ,  kcolA  ,1  ,  kcolsz  ); 

}  //  section 

}  //  sections 

//  grab  kptr  and  p  using  streams 
//  located  in  the  GCM  bank,  per  cg.c 
#pragma  src  parallel  sections 
{ 

#pragma  src  section 

{  //  stream  DMA  in  packed  kptr  vector  from  OBCM 

streamed _dma_gcm_256(&Skptr256  , PORT_TO_STREAM , 

PATH.O  ,  kptr  A  ,1  ,(int64_t)kptrsz); 
stream.term (&Skptr256  ) ;  //  term  required  by  width  converter 


#pragma  src  section 
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2145 

2146 

2147 

2148 

2149 

2150 

2151 

2152 

2153 

2154 

2155 

2156 

2157 

2158 

2159 

2160 

2161 

2162 

2163 

2164 

2165 

2166 

2167 

2168 

2169 

2170 

2171 

2172 

2173 

2174 

2175 

2176 


{  //  convert  to  32— bit  stream 

stream_width_25  6to3  2_term(&Skptr256  ,&Skptr32); 

} 

#pragma  src  section 

{  //  put  kptr  32— bit  stream  into  BRAM 

int  curr  ;  //  current  kptr  value 
int  i  ,  j  ; 

for  (i=0;  i<kptrsz  /  sizeof  (  int  ) ;  i++)  {  //  pipeline  depth:  5 
get_stream(&Skptr32  ,&curr  ) ; 

//  ordering  is  1  ,0  ,3  ,2  ,5  ,4  , . . . 
j  =(  i  %2)?i  — 1:  i +1 ;  //  fix  the  ordering  problem 

kptrL  [  j  ]  =  curr  ; 

} 

}  //  section 

#pragma  src  section 

{  //  stream  DMA  in  packed  p  vector  from  OBCM 

streamed  _dma_gcm_256 (&Sp256_l  ,  PORT  JTOJSTREAM , 

PATH.l  ,pA,  1  ,(  int64_t  )psz); 

stream.term (&Sp256_l  ) ;  //  term  required  by  width  converter 

} 

#pragma  src  section 
{  //  convert  to  32— bit  stream 

stream_width_256to32_term(&Sp256_l  ,&Sp32_l  ); 

} 

#pragma  src  section 

{  //  put  p  vector  32— bit  stream  into  BRAM 

float  curr  ; 
int  i  ,  j  ; 

for  (i=0;  i<psz  /  sizeof  (  fl  o  at  ) ;  i ++)  {  //  pipeline  depth:  5 
get_stream_flt(&Sp32_l  ,&curr  ); 

//  ordering  is  1  ,0  ,3  ,2  ,5  ,4  , . . . 
j  =(  i  %2)?i  — 1:  i  + 1 ; 


116 


132 


2177 

2178 

2179 

2180 

2181 

2182 

2183 

2184 

2185 

2186 

2187 

2188 

2189 

2190 

2191 

2192 

2193 

2194 

2195 

2196 

2197 

2198 

2199 

2200 

2201 

2202 

2203 

2204 

2205 

2206 

2207 

2208 


//  remember  we  need  8  copies  of  p  for  the  dot  product  unit 

pCO  [  j  ]  =  curr  ; 

pCl [ j ]  =  curr ; 

pC2 [ j ]  =  curr ; 

pC3  [j  ]  =  curr  ; 

pC4 [ j ]  =  curr ; 

pC5 [ j ]  =  curr ; 

pC6  [  j  ]  =  curr  ; 

pC7 [ j ]  =  curr ; 

} 

}  //  section 
}  //  sections 


}  else 

{ 

//  only  need  p 

// 

grab  p 

using  streams 

// 

located 

in  the  GCM  bank,  per  eg .  c 

#pragma  src  parallel  sections 

{ 

#pragma  src  section 

{  //  stream  DMA  in  packed  p  vector  from  OBCM 

streamed  _dma_gcm_256 (&Sp256_2  , PORT.TO JSTREAM , 

PATH.l  , pA,  1  ,(  int64_t  )psz  ); 

stream.term (&Sp256_2 ) ;  //  term  required  by  width  converter 

} 

#pragma  src  section 
{  //  convert  to  32— bit  stream 

stream_width_25  6to3  2_term(&Sp256_2  ,&Sp32_2  ) ; 

} 

#pragma  src  section 

{  //  put  p  vector  32— bit  stream  into  BRAM 

float  curr  ; 
int  i  ,  j  ; 
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2209 

2210 

2211 

2212 

2213 

2214 

2215 

2216 

2217 

2218 

2219 

2220 

2221 

2222 

2223 

2224 

2225 

2226 

2227 

2228 

2229 

2230 

2231 

2232 

2233 

2234 

2235 

2236 

2237 

2238 

2239 


for  (i=0;  i<psz  /  si  zeof  (  f  1  o  at  ) ;  i ++)  {  //  pipeline  depth:  5 
get_stream_flt(&Sp32_2  ,&curr  ) ; 

//  ordering  is  1  ,0  ,3  ,2  ,5  ,4  , . . . 
j  =(  i  %2)?i  — 1:  i  + 1 ; 

//  remember  we  need  8  copies  of  p  for  the  dot  product  unit 

pCO  [  j  ]  =  curr  ; 

pCl  [  j  ]  =  curr  ; 

pC2  [  j  ]  =  curr  ; 

pC3  [  j  ]  =  curr  ; 

pC4  [  j  ]  =  curr  ; 

pC5  [  j  ]  =  curr  ; 

pC6  [  j  ]  =  curr  ; 

pC7  [  j  ]  =  curr  ; 

} 

}  //  section 
}  //  sections 

} 

//  compute  v  =  Ap 
#pragma  src  parallel  sections 
{ 

#pragma  src  section 

{  //  send  the  dot8  vals  to  accumulator 

int  i  ; 

for  ( i  =0 ;  i<knz/K;  i ++)  {  //  pipeline  depth:  63 

//  all  k— groups  in  each  row 
//  grab  next  k  matrix  values  (a8  values) 
s  p  1  i  t  _  6  4 1  o  3  2  _f  1 1  _f  1 1  (  kvalLO  1  [  i  ]  ,&al  ,&a0  ) ;  //  2  each 
s  p  1  i  t  _  6  4 1  o  3  2  _f  1 1  _f  1 1  ( kvalL23  [  i  ]  ,&a3  ,&a2  ) ;  //  unpack 
s  p  1  i  t  _  6  4 1  o  3  2  _f  1 1  _f  1 1  (  kvalL45  [  i  ]  ,&a5  ,&a4  ) ;  //  from 
split_64to32_flt_flt(kvalL67[i],&a7,&a6);  //  OBM 


118 


2240 


134 


2241 

// 

grab  next  k  column 

pointers  (j8  columns) 

2242 

sp lit_64to 3 2  ( kcolLO  1  [ 

i],&jl,&j0);  //  2  each 

2243 

split_64to32  (kcolL23  [ 

i],&j3,&j2);  //  unpack 

2244 

split_64to32  (kcolL45  [ 

i],&j5,&j4);  //  from 

2245 

split_64to32  (kcolL67  [ 

i  ]  ,&j7  ,&j6  ) ;  //  OBM 

2246 

2247 

pO 

=  pCO [ jO ] ; 

//  use  the  j8  column  pointers 

2248 

pi 

=  pCl [ j 1 ]; 

//  to  grab  matching  p’s  (p8) 

2249 

P2 

=  pC2 [ j  2 ] ; 

//  remember ,  we  have  k 

2250 

p3 

=  pC3 [ j  3 ] ; 

//  copies  of  p 

2251 

p4 

=  pC4 [ j  4 ] ; 

//  to  avoid  memory— based 

2252 

p5 

=  pC5[j5 ]; 

//  loop  — carried  dependence 

2253 

p6 

=  pC6 [ j  6 ]; 

2254 

P? 

=  pC7 [ j  7 ] ; 

2255 

2256 

// 

binary  tree  pipeline  for  dot8 

2257 

// 

recall  ,  this  delivers  a  result 

2258 

// 

every  clock  cycle 

(after  the  latency) 

2259 

// 

2260 

// 

aO  ,  pO — >*mnO 

2261 

// 

>+anO 

2262 

// 

al  , pi — >*mnl  \ 

2263 

// 

>+an4 

2264 

// 

a2  ,  p2 — >*mn2  / 

\ 

2265 

// 

>+anl 

\ 

2266 

// 

a3  ,  p3 — >*mn3 

\ 

2267 

// 

>+dot8 — ■> 

2268 

// 

a4  ,  p4 — >*mn4 

/ 

2269 

// 

>+an2 

/ 

2270 

// 

a5  ,  p5 — >*mn5  \ 

/ 

2271 

// 

>+an5 

2272 

// 

a6  ,  p6 — >*mn6  / 
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2273 

// 

>+an3 

2274 

II  a7  ,  p7 — >*mn7 

2275 

2276 

mnO 

=  aO  *  pO  ;  // 

leaf  nodes  (multiplier  stage) 

2277 

mnl 

=  al  *  pi  ; 

2278 

mn2 

=  a2  *  p2 ; 

2279 

mn3 

=  a3  *  p3  ; 

2280 

mn4 

=  a4  *  p4 ; 

2281 

mn5 

=  a5  *  p5  ; 

2282 

mn6 

=  a6  *  p6 ; 

2283 

mn7 

=  a7  *  p7 ; 

2284 

2285 

anO 

=  mnO  +  mnl ;  /  / 

first  adder  stage 

2286 

anl 

=  mn2  +  mn3 ; 

2287 

an2 

=  mn4  +  mn5 ; 

2288 

an3 

=  mn6  +  mn7 ; 

2289 

2290 

an4 

=  anO  +  anl  ;  // 

second  adder  stage 

2291 

an5 

=  an2  +  an3  ; 

2292 

2293 

dot8 

=  an4  +  an5  ;  // 

root  node  output  adder 

2294 

2295 

//  stream  next  dot8  to  streaming  accumulator 

2296 

put. 

stream  .fit  (&Sva  ,  dot  8 

.1); 

2297 

2298 

}  //  for 

i  ( next  k— group ) 

2299 

stream.term (&Sva ) ;  //  termination  required  by  accum 

2300 

} 

2301 

#pragma 

src 

section 

2302 

{  // 

send 

#  of  dot8  ’  s  per  row 

to  the  accumulator 

2303 

int 

i  ; 

2304 

int 

curr 

,  prev ;  //  current  & 

previous  to  calculate  counts 
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2305 

2306 

2307 

2308 

2309 

2310 

2311 

2312 

2313 

2314 

2315 

2316 

2317 

2318 

2319 

2320 

2321 

2322 

2323 

2324 

2325 

2326 

2327 

2328 

2329 

2330 

2331 

2332 

2333 

2334 

2335 

2336 


int  dot8cnt  ;  //  number  of  dot8s  per  row 

//  note:  kptr  includes  n  +  l’th  item,  also 

//  transfer  curr  to  prev  is  part  of  loop  control 

prev  =  kptrL[0];  //  preload  initial  prev  value 

for  (i=l;  i<=n;  prev  =  curr  ,  i ++)  {  //  pipeline  depth:  5 

//  calculate  #  of 

curr  =  kptrL  [  i  ] ;  //  partial  dot  products 

dot8cnt  =  curr  —  prev  ;  //  per  matrix  row  ,  send 

put.stream  (&Sca  ,  dot8cnt  ,1);  //  to  streaming  accum 

} 

stream.term (&Sca ) ;  //  termination  required  by  accum 

} 

#pragma  src  section 
{  //  streaming  accumulator 

stream_fp_accum_strm_counts_3  2_rr_term(&Sva,&Sca,&Ssa); 

//  I 

//  input  dot  product  vals  (  d o 1 8 s  )  - +  |  [ 

//  input  #  of  dot8s  for  each  row  - + 

//  output  full  dot  product  stream  (dotN)  < - h 

}  //  section 

#pragma  src  section 

{  //  finish  v=  Ap  calculation  ,  store  in  OBCM 

float  prev  ,  curr  ;  //  last  two  v[i]’s 

int64_t  packed ; 
int  i  ; 

for  (i=0;  i<n;  prev  =  curr  ,  i ++)  {  //  pipeline  depth:  100 

get_stream_f  It  (&Ssa  ,& curr  ) ;  //  from  accumulator 

if  (i%2)  {  //  odd,  so  combine 

comb_32to64_flt_flt(curr  ,prev,&packed  ); 

} 
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122 

2337 

2338 

2339 

2340 

2341 

2342 

2343 

2344 

2345 

2346 

2347  } 

A. 3  Performance  Evaluation  Application 

A. 3.1  usec.h 

2348  //  usee  .  h 

2349  //  need  usee  time  resolution  to  measure  performance 

2350  double  usee  (void); 

A. 3. 2  usec.c 

2351  #include  <sys/time.h> 

2352  #include  <stdlib.h> 

2353  // - 

2354  //  need  microsecond  resolution  to  measure  performance 

2355  double  usee  (void)  { 

2356 

2357  struct  timeval  tO  ;  //  time 

2358 

2359  gettimeofday  (&t0  ,NULL) ; 

2360 


put.stream  (&Sv  ,  packed  ,(  i  %2));  //  when  it  is  odd 

} 

}  //  section 

//  matrix  vector  multiply  complete:  send  result  back 
#pragma  sre  section 

{  //  DMA  stream  out  v  values  in  packed  stream 

streamed_dma_gcm(&Sv , 

STREAM.TOJPORT , PATH.O  , vA ,  1  ,(  int64_t )  vsz  ) ; 

}  //  section 

}  //  sections 


2361 


//  if  we  overrun  the  epoch,  boom! 


138 


2362 

2363 

2364  } 
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//  we’ll  assume  this  seldom  happens!! 
return  ((double  )(t0  .  tv_sec*le6  +  t0  .  tv.usec  )); 
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